Realistic evolutionary models Marjolijn Elsinga & Lars Hemel
Realistic evolutionary models Contents Models with different rates at different sites Models which allow gaps Evaluating different models Break Probabilistic interpretation of Parsimony Maximum Likelihood distances
Unrealistic assumptions 1 Same rate of evolution at each site in the substitution matrix - In reality: the structure of proteins and the base pairing of RNA result in different rates 2 Ungapped alignments - Discard useful information given by the pattern of deletions and insertions
Different rates in matrix Maximum likelihood, sites are independent X j for j = 1…n
Different rates in matrix (2) Introduce a site-dependent variable r u
Different rates in matrix (3) We don’t know r u, so we use a prior Yang [1993] suggests a gamma distribution g(r, α, α), with mean = 1 and variance = 1/α
Problem Number of terms grows exponentially with the number of sequences computationally slow Solution: approximation - Replace integral by a discrete sum - Subdivide domain into m intervals - Let r k denote the mean of the gamma distribution in the kth interval
Solution Yang [1993] found m = 3.4 gives a good approximation Only m times as much computation as for non-varying sites
Evolutionary models with gaps (1) Idea 1: introduce ‘_’ as an extra character of the alphabet of K residues and replace the (KxK) matrix with a (K+1) x (K+1) matrix Drawback: no possibility to assign lower cost to a following gap, gaps are now independent
Evolutionary models with gaps (2) Idea 2: Allison, Wallace & Yee [1992] introduce delete and insertion states to ensure affine-type gaps Drawback: computationally intractable
Evolutionary models with gaps (3) Idea 3: Thorne, Kishino & Felsenstein [1992] use fragment substitution to get a degree of biological plausibility Drawback: usable for only two sequences
Finally Find a way to use affine-type gap penalties in a computationally reasonable way Mitchison & Durbin [1995] made a tree HMM which uses a profile HMM architecture, and treats paths through the model as objects that undergo evolutionary change
Assumptions needed again We will use a architecture quite simpler than that of the profile HMM of Krogh et al [1994]: it has only match and delete states Match state: M k Delete state: D k k = position in the model
Tree HMM with gaps (1) Sequence y is ancestor of sequence x Both sequences are aligned to the model, so both follow a prescribed path through the model
Tree HMM with gaps (2) x emits residu x i at M k y emits residu y j at M k Probability of substitution y j x i is P(x i | y j,t)
Tree HMM with gaps (3) What if x goes a different path than y x: M k D k+1 (= MD) y: M k M k+1 (= MM) P(MD|MM, t)
Tree HMM with gaps (4) x: D k+1 M k+2 (= DM) y: M k+1 M k+2 (= MM) We assume that the choice between DD and DM is controlled by a mutational process that operates independently from y
Substitution matrix The probabilities of transitions of the path of x are given by priors: D k+1 M k+2 has probability q DM
How it works At position k: q yj P(x i |y j,t) Transition k k+1: q MM P(MD|MM,t) Transition k+1 k+2: q MM q DM
An other example
Evaluating models: evidence Comparing models is difficult Compare probabilities: P(D|M 1 ) and P(D|M 2 ) by integrating over all parameters of each model Parameters θ Prior probabilities P(θ )
Comparing two models Natural way to compare M 1 and M 2 is to compute the posterior probability of M 1
Parametric Bootstrap Let be the maximum likelihood of the data D for the model M 1 Let be the maximum likelihood of the data D for the model M 2
Parametric bootstrap (2) Simulate datasets D i with the values of the parameters of M 1 that gave the maximum likelihood for D If Δ exceed almost all values of Δ i M 2 captured more aspects of the data that M 1 did not mimic, therefore M 1 is rejected
Break
Probabilistic interpretation of various models Lars Hemel
Overview Review of last week’s method Parsimony – Assumptions, Properties Probabilistic interpretation of Parsimony Maximum Likelihood distances – Example: Neighbour joining More probabilistic interpretations – Sankoff & Cedergren – Hein’s affine cost algorithm Conclusion / Questions?
Review Parsimony = Finding a tree which can explain the observed sequences with a minimal number of substitutions
Parsimony Remember the following assumptions: – Sequences are aligned – Alignments do not have gaps – Each site is treated independently Further more, many families have: – Substitution matrix is multiplicative: – Reversibility:
Parsimony Basic step: counting the minimal number of changes for one site Final number of substitutions is summing over all the sites Weighted parsimony uses different ‘weights’ for different substitutions
Probabilistic interpretation of parsimony Given: A set of substitution probabilities P(b|a) in which we neglect the dependence on length t Calculate substitution costs S(a,b) = -log P(b|a) Felsenstein [1981] showed that by using these substitution costs, the minimal cost at site u for the whole tree T obtained by the weighted parsimony algorithm is regarded as an approximation to the likelihood
Probabilistic interpretation of parsimony Testing performance for tree-building algorithms can be done by generating trees probabilistic with sampling and then see how often a given algorithm reconstructs them correctly Sampling is done as follows: – Pick a residue a at the root with probability – Accept substitution to b along the edge down to node i with probability repetitive – Sequences of length N are generated by N independent repetitions of this procedure – Maximum likelihood should reconstruct the correct tree for large N
Probabilistic interpretation of parsimony Suppose we have tree T, with the following edgelengths And substitutionmatrix with p=0.3 for leaves 1,3 and p=0.1 for 2 and
Probabilistic interpretation of parsimony Tree with n leaves has (2n-5)!! unrooted trees
Probabilistic interpretation of parsimony Parsimony can constructs the wrong tree even for large N N N Parsimony Maximum likelihood
Probabilistic interpretation of parsimony Suppose the following example: A tree with A,A,B,B at the places 1,2,3 and 4 A A B B
Probabilistic interpretation of parsimony With parsimony the number of substitutions are calculated AA B B A A B B A A A B 2 1 Parsimony constructs the right tree with 1 substitution more often than the left tree with 2
Maximum Likelihood distances Suppose tree T, edge lengths and sampled sequences at the leafs We’ll try to compute the distance between and
By multiplicativety Maximum Likelihood distances
By reversibility and multiplicativity
Maximum Likelihood distances
ML distances between leaf sequences are close to additive, given large amount of data
Example: Neighbour joining i j k m
Use Maximum Likelihood distances Suppose we have a multiplicative reversible model Suppose we have plenty of data The underlying probabilistic model is correct Then Neighbour joining will construct any tree correctly.
Example: Neighbour joining Neighbour joining using ML distances It constructs the correct tree where Parsimony failed N
More probabilistic interpretations Sankoff & Cedergren – Simultaneously aligning sequences and finding its phylogeny, by using a character substitution model – Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. Allison, Wallace & Yee [1992] – But as the original S&C method it is not practical for most problems.
More probabilistic interpretations Hein’s affine cost algorithm – Simultaneously aligning sequences and finding its phylogeny, by using affine gap penalties – Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. – But when using plus in stead of max we have to include all the paths, which will cost at the first node above the leaf and at the next and so on. So all the speed advantages are gone.
Conclusion Probabilistic interpretations can be better – Compare ML with parsimony They can also be less useful, because of costs which get too high – Sankoff & Cedergren Neighbour joining constructs the correct tree if it has the correct assumptions So, the trick is to know your problem and to decide which method is the best Questions??