Presentation is loading. Please wait.

Presentation is loading. Please wait.

Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Similar presentations


Presentation on theme: "Realistic evolutionary models Marjolijn Elsinga & Lars Hemel."— Presentation transcript:

1 Realistic evolutionary models Marjolijn Elsinga & Lars Hemel

2 Realistic evolutionary models Contents Models with different rates at different sites Models which allow gaps Evaluating different models Break Probabilistic interpretation of Parsimony Maximum Likelihood distances

3 Unrealistic assumptions 1 Same rate of evolution at each site in the substitution matrix - In reality: the structure of proteins and the base pairing of RNA result in different rates 2 Ungapped alignments - Discard useful information given by the pattern of deletions and insertions

4 Different rates in matrix Maximum likelihood, sites are independent X j for j = 1…n

5 Different rates in matrix (2) Introduce a site-dependent variable r u

6 Different rates in matrix (3) We don’t know r u, so we use a prior Yang [1993] suggests a gamma distribution g(r, α, α), with mean = 1 and variance = 1/α

7 Problem Number of terms grows exponentially with the number of sequences  computationally slow Solution: approximation - Replace integral by a discrete sum - Subdivide domain into m intervals - Let r k denote the mean of the gamma distribution in the kth interval

8 Solution Yang [1993] found m = 3.4 gives a good approximation Only m times as much computation as for non-varying sites

9 Evolutionary models with gaps (1) Idea 1: introduce ‘_’ as an extra character of the alphabet of K residues and replace the (KxK) matrix with a (K+1) x (K+1) matrix Drawback: no possibility to assign lower cost to a following gap, gaps are now independent

10 Evolutionary models with gaps (2) Idea 2: Allison, Wallace & Yee [1992] introduce delete and insertion states to ensure affine-type gaps Drawback: computationally intractable

11 Evolutionary models with gaps (3) Idea 3: Thorne, Kishino & Felsenstein [1992] use fragment substitution to get a degree of biological plausibility Drawback: usable for only two sequences

12 Finally Find a way to use affine-type gap penalties in a computationally reasonable way Mitchison & Durbin [1995] made a tree HMM which uses a profile HMM architecture, and treats paths through the model as objects that undergo evolutionary change

13 Assumptions needed again We will use a architecture quite simpler than that of the profile HMM of Krogh et al [1994]: it has only match and delete states Match state: M k Delete state: D k k = position in the model

14 Tree HMM with gaps (1) Sequence y is ancestor of sequence x Both sequences are aligned to the model, so both follow a prescribed path through the model

15 Tree HMM with gaps (2) x emits residu x i at M k y emits residu y j at M k Probability of substitution y j  x i is P(x i | y j,t)

16 Tree HMM with gaps (3) What if x goes a different path than y x: M k  D k+1 (= MD) y: M k  M k+1 (= MM) P(MD|MM, t)

17 Tree HMM with gaps (4) x: D k+1  M k+2 (= DM) y: M k+1  M k+2 (= MM) We assume that the choice between DD and DM is controlled by a mutational process that operates independently from y

18 Substitution matrix The probabilities of transitions of the path of x are given by priors: D k+1  M k+2 has probability q DM

19 How it works At position k: q yj P(x i |y j,t) Transition k  k+1: q MM P(MD|MM,t) Transition k+1  k+2: q MM q DM

20 An other example

21 Evaluating models: evidence Comparing models is difficult Compare probabilities: P(D|M 1 ) and P(D|M 2 ) by integrating over all parameters of each model Parameters θ Prior probabilities P(θ )

22 Comparing two models Natural way to compare M 1 and M 2 is to compute the posterior probability of M 1

23 Parametric Bootstrap Let be the maximum likelihood of the data D for the model M 1 Let be the maximum likelihood of the data D for the model M 2

24 Parametric bootstrap (2) Simulate datasets D i with the values of the parameters of M 1 that gave the maximum likelihood for D If Δ exceed almost all values of Δ i  M 2 captured more aspects of the data that M 1 did not mimic, therefore M 1 is rejected

25 Break

26 Probabilistic interpretation of various models Lars Hemel

27 Overview Review of last week’s method Parsimony – Assumptions, Properties Probabilistic interpretation of Parsimony Maximum Likelihood distances – Example: Neighbour joining More probabilistic interpretations – Sankoff & Cedergren – Hein’s affine cost algorithm Conclusion / Questions?

28 Review Parsimony = Finding a tree which can explain the observed sequences with a minimal number of substitutions

29 Parsimony Remember the following assumptions: – Sequences are aligned – Alignments do not have gaps – Each site is treated independently Further more, many families have: – Substitution matrix is multiplicative: – Reversibility:

30 Parsimony Basic step: counting the minimal number of changes for one site Final number of substitutions is summing over all the sites Weighted parsimony uses different ‘weights’ for different substitutions

31 Probabilistic interpretation of parsimony Given: A set of substitution probabilities P(b|a) in which we neglect the dependence on length t Calculate substitution costs S(a,b) = -log P(b|a) Felsenstein [1981] showed that by using these substitution costs, the minimal cost at site u for the whole tree T obtained by the weighted parsimony algorithm is regarded as an approximation to the likelihood

32 Probabilistic interpretation of parsimony Testing performance for tree-building algorithms can be done by generating trees probabilistic with sampling and then see how often a given algorithm reconstructs them correctly Sampling is done as follows: – Pick a residue a at the root with probability – Accept substitution to b along the edge down to node i with probability repetitive – Sequences of length N are generated by N independent repetitions of this procedure – Maximum likelihood should reconstruct the correct tree for large N

33 Probabilistic interpretation of parsimony Suppose we have tree T, with the following edgelengths 0.09 0.1 0.3 And substitutionmatrix with p=0.3 for leaves 1,3 and p=0.1 for 2 and 4 1 2 4 3

34 Probabilistic interpretation of parsimony Tree with n leaves has (2n-5)!! unrooted trees 1 2 3 4 1 2 4 3 12 3 4

35 Probabilistic interpretation of parsimony Parsimony can constructs the wrong tree even for large N N 20419339242 100638204158 5009046135 200099730 N 20396378224 10040551579 5004045942 20003536460 Parsimony Maximum likelihood

36 Probabilistic interpretation of parsimony Suppose the following example: A tree with A,A,B,B at the places 1,2,3 and 4 A A B B

37 Probabilistic interpretation of parsimony With parsimony the number of substitutions are calculated AA B B A A B B A A A B 2 1 Parsimony constructs the right tree with 1 substitution more often than the left tree with 2

38 Maximum Likelihood distances Suppose tree T, edge lengths and sampled sequences at the leafs We’ll try to compute the distance between and

39 By multiplicativety Maximum Likelihood distances

40 By reversibility and multiplicativity

41 Maximum Likelihood distances

42 ML distances between leaf sequences are close to additive, given large amount of data

43 Example: Neighbour joining i j k m

44 Use Maximum Likelihood distances Suppose we have a multiplicative reversible model Suppose we have plenty of data The underlying probabilistic model is correct Then Neighbour joining will construct any tree correctly.

45 Example: Neighbour joining Neighbour joining using ML distances It constructs the correct tree where Parsimony failed N 20477301222 100635231134 5008968519 200099750

46 More probabilistic interpretations Sankoff & Cedergren – Simultaneously aligning sequences and finding its phylogeny, by using a character substitution model – Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. Allison, Wallace & Yee [1992] – But as the original S&C method it is not practical for most problems.

47 More probabilistic interpretations Hein’s affine cost algorithm – Simultaneously aligning sequences and finding its phylogeny, by using affine gap penalties – Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. – But when using plus in stead of max we have to include all the paths, which will cost at the first node above the leaf and at the next and so on. So all the speed advantages are gone.

48 Conclusion Probabilistic interpretations can be better – Compare ML with parsimony They can also be less useful, because of costs which get too high – Sankoff & Cedergren Neighbour joining constructs the correct tree if it has the correct assumptions So, the trick is to know your problem and to decide which method is the best Questions??


Download ppt "Realistic evolutionary models Marjolijn Elsinga & Lars Hemel."

Similar presentations


Ads by Google