Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel.

Similar presentations


Presentation on theme: "1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel."— Presentation transcript:

1 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

2 2 Preview The Phylogenetic Reconstrutction Problem

3 3 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACGGTCA ACGGATA ACGGGTA ACCCGTG ACCGTTG TCTGGTA TCTGGGA TCCGGAAAGCCGTG GGGGATT AAAGTCA AAAGGCG AAACACA AAAGCTG Evolution is modeled by a Tree (All our sequences are DNA sequences, consisting of {A,G,C,T})

4 4 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACCGTTG TCTGGGA TCCGGAAAGCCGTG GGGGATT Phylogenetic Reconstruction

5 5 B : AATCCTG C : ATAGCTG A : AATGGGC D : GAACGTA E : AAACCGA J : ACCGTTG G : TCTGGGA H : TCCGGAA I : AGCCGTG F : GGGGATT Goal: reconstruct the true tree as accurately as possible reconstruct A B C F G IHJ D E A B C F G I H J D E (root) Phylogenetic Reconstruction

6 6 Misspecified model of evolution Short and deep edges vs. limited seq. length: - Weak signals (short edges) - Decaying signals (deep edges) ACGGATA ACGGGTA ACCGATG What Makes Reconstruction Tough? Assume stochastic model (eg Kimura 2 Parameter)

7 7 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

8 8 A C B D F G E edge-weighted true tree reconstructed tree reconstruction B C A D F G E 5 6 0.4 6 30.3 2 2 4 5 Challange: minimize the effect of noise Introduced by the sampling Distance Based Phylogenetic Reconstruction: Exact vs. Noisy distances Estimated distances Exact (additive) distances Between species Distance estimation using finite Sampling

9 9 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of known distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

10 10 The Kimura 2 Parameter (K2P) model [Kimura80]: each edge corresponds to a Rate Matrix Transitions Transversions Transitions K2P generic rate matrix u v

11 11 K2P standard distance: Δ total = Total substitution rate u vw The total substitution rate of a K2P rate matrix R is This is the expected number of mutations per site. It is an additive distance. + α + 2β (α+α) + 2(β+ β)

12 12 Estimation of Δ total (R uv ) = d K2P (u,v) is a noisy stochastic process u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT K2P total rate distance correction procedure

13 13 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

14 14 Check performance of K2P standard distances in resolving quartet-splits AC BD AB C D AC DB Distance methods reconstruct the true split by 4-point condition: There are 3 possible quartet topologies: w sep The 4-point condition for noisy distances is:

15 15 We evaluate the accuracy of the K2P distance estimation by Split Resolution Test: root D C A B t is evolutionary time The diameter of the quartet is 22t

16 16 Phase A: simulate evolution D C A B

17 17 Phase B: reconstruct the split by the 4p condition DCBA Apply the 4p condition. Was the correct split found? estimate distances between sequences, Repeat this process 10,000 times, count number of failures

18 18 the split resolution test was applied on the model quartet with various diameters For each diameter, mark the fraction (percentage) of the simulations in which the 4p condition failed (next slide) ……

19 19 Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2 Template quartet

20 20 Performance for larger diameters site saturation

21 21 Transitions Transversions Transitions When β < α, we can postpone the site saturation effect. For this, use another distance function for the same model, Δ tv, which counts only transversions: This is actually the CFN model [Cavendar78, Farris73, Neymann71] α α β

22 22 Apply the same split resolution test on the transversions only distance: u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT Transversions only Distance correction procedure

23 23 transversions only performs better on large, worse on small rates Transversions only total K2P rate

24 . 4 5 7 2 1 2 10 6 1 Conclusion: Distance based reconstruction methods should be adaptive: Find a distance function d which is good for the input We do a small step in this direction: Input: An alignment of the sequences at u, v. Output: a (near)-optimal distance function, which minimizes the expected noise in the estimation procedure.

25 25 Example: An adaptive distance method (max-optimal) based on this talk:

26 26 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and Substitution Rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

27 27 Steps in finding optimal distance functions: 1.Define substitution model. 2.Characterize the available distance functions. 3.Select a function which is optimal for the input sequences. least sensitive to stochastic noise

28 28 From Rate matrices to Substitution matrices AACA…GTCTTCGAGGCCC u v AGCA…GCCTATGCGACCT Rate matrices imply stochastic substitution matrices: Evolution of a finite sequence by unknown model parameters α, β A stochastic substitution matrix P uv

29 29 A substitution model M : A set of stochastic substitution matrices, closed under matrix product: P,Q M PQ M u v w Motivation to the definition: Also required P>0, 0<det(P)<1 for all P M

30 30 Uniform distribution Model tree over M = + + r v P rv P..

31 31 Distances for a given model are defined by Substitution Rate functions: u v w Δ : M is an SR function for M iff for all P,Q in M : 1.Δ(PQ) = Δ(P)+ Δ(Q) (additivity) 2.Δ(P)>0 (positivity)

32 32 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

33 33 1 st question: Given a model M, what are its SR functions? X additive SR functions are additive functions which are strictly positive

34 34 Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, M univ : M univ = {P: P is a stochastic 4 4 matrix, 0<det(P)<1}.

35 35 Example 2: The log eigenvalue function

36 36 Both logdet and the log eigenvalue functions are special cases of a general technique: Generalized logdet which is given below:

37 37 Linearity of additive functions: 1.If Δ 1 and Δ 2 are additive functions for M, so is c 1 Δ 1 + c 2 Δ 2 The set of additive functions for M forms a vector space, to be denoted AD M. Dimension(AD M ) is the dimension of this vector space. Large dimension implies more independent distance functions If dimension(AD M ) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(AD M ) > 1.

38 38 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models: Models which the adaptive approach is potentially useful. Optimizing Distances in the K2P model Simulation results

39 39 Unified Substitution Models: U -1 PU = Def: A model M is unified if there is a matrix U s.t. for each P M it holds that: Using Lemma GLD, we have:

40 40 Strongly Unified Substitution Models U -1 PU = Def: A model M is strongly unified if there is a matrix U s.t. for each P M it holds that:

41 41 A simple strongly unified model: The Jukes Cantor model [1969] M JC = For all P M JC, U -1 PU = :0< p <0.25 M JC is strongly unified by U= Claim dimension(AD M JC )=1 Hence the adaptive approach is irrelevant to this model.

42 42 Another model M for which dimension(AD M )=1 Recall: M univ consists of all DNA transition matrices. Claim 2: dimension(AD M univ ) = 1 This means that all the additive functions of M univ are proportional to logdet. Hence the adaptive approach is irrelevant also to this model. Luckily, the additive functions of intermediate unified models have dimensions > 1, hence the adaptive approach is useful for them. Next we return to the Kimura 2 parameter model.

43 43 Back to K2P: For every K2P Substitution Matrix P: 1000 0λPλP 00 00μPμP 0 000μPμP Where: λ P = 1 - 4P β = e -4β μ P = 1 - 2P β - 2P α = e -2α-2β U -1 PU = P = 0 < λ P <1 0 < μ P < 1 Conclusion: dimension ( AD M K2P )=2. U of the JC model

44 44 The functions: Δ λ (P)= -ln(λ P ), Δ μ (P)=-ln(μ P ) Form a basis of AD K2P u v The standard total rate distance is: Δ K2P (P)=-(ln(λ P )+2ln(μ P ))/4=-Δ logdet (P)/4. The transversion only distance is: Δ tr (P)=-ln(λ P )/4.

45 . The Adaptive distance based algorithm for the K2P model ACCGTTG AGCCGTG

46 46 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

47 47 u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT K2P distance estimation: where the noise comes from inherent noise implied noise propagation user controlled noise propagation

48 48 u v Selection of c 1, c 2 True distance Expected error Estimated distance +=

49 49 Expected Relative Error True distance Expected error = =

50 50 Minimizing the expected relative error

51 51 This means that equivalent SR functions have the same NMSE A basic property of Normalized Mean Square Error:

52 52 A Proper Disclosure on our optimal functions:

53 53 Relation between c and SR functions : Function nameFunctionc c/(1+c) Total rate (logdet) -ln(λ P )-2ln(μ P ) 1/21/3 Transversions only -ln(λ P ) 1

54 54 α=20β Optimal values of c opt / (1+c opt ) for ti/tv ratio = 10 As the rate grows, the relative weight of the transversion coefficient increases

55 55 α=2β α=4βα=20β Optimal values of c 1 / (c 1 +c 2 ) for various transitions/transversion rates α=βα=β α>>β,rate>2 α=200β

56 56 Expected Relative error of various distance functions: theoretical prediction Total rate transversions optimal

57 57 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

58 58 Expected Relative error of various distance functions: simulations Total rate Transversions only optimal small eigenvalue distortion

59 59 Back to the K2P quartet resolution A heuristic distance method (max-optimal) based on this talk: Select a distance function which is optimal w.r.t. the largest of the six observed distances of the quartet (ie, largest c opt ). Recall the performance of the two known distance function on the template quartet

60 60 When α β, the suggested heuristic performs better than both known methods.

61 61 Summary Adaptive approach to distance based reconstructions: adjust distance function to input sequences. Distance functions for stochastic evolutionary models are defined by SR functions. SR functions can be constructed by Generalized Logdet. When the dimension of the space of SR functions is greater than 1, the adaptive approach is applicable. The adaptive approach is applicible to non-trivial unified models. Most common models are unified. An analysis of the simplest non-trivial unified model - K2P - shows a significant improvements in the accuracy of the adaptive approach.

62 62 Further Research u Prove/Disprove: For any substitution model M, all the additive functions of M are GLD functions. u In the K2P model: l Define&find optimal SR functions for: two distances, quartets, general trees. l Find optimal SR functions for non-homogenous model trees l Find optimal SR functions to variable rates cross sites. u Find optimal SR functions for more general evolutionary models (Tamura Nei) (analytic/heuristic methods) u Empirical/analytical study of plugging adaptive distances in common reconstruction algorithms (eg NJ). u Study improvement in performance on real biological data. u Devise algorithms which use distance-vectors

63 63

64 64

65 65 Further research questions We have infinitely many additive distance functions for the K2P model. Which one should we use for reconstructing the tree? If we have the exact substitution matrices for all pairs of taxa, then all functions are equally good. But we have only finite sequences, whose alignments provide only estimations of the true substitution matrices

66 66 Distances are defined by Substitution Rate functions u v w For each tree path u vw It holds that D(u,v)+D(v,w)=D(u,w). D(u,v)D(u,v) D(v,w)D(v,w) D(u,w)= D(u,v)+D(v,w)

67 67 Part 3.1: from Substitution models to Additive distances

68 68 The aligned sequences provide for each pair of DNA letters, say A and G, how many times A was mutated to G This defines a joint distribution matrix F Aligned Sequences joint distribution matrices AGTC A0.20.050.010.02 G 0.250.01 T0.020.010.160.02 C0.01 0.2 F = A is aligned with G In 5% of the pairs

69 69 Joint Distribution matrices are converted to distances by Substitution models. These models describe how DNA sequences are transformed during the evolution. The tool used for this is called Markovian Processes. In the following we will sketch it. Additional reading is recommended…

70 70 species C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT K2P Distinguish between two mutations types: Transitions {A G, C T} And Transversions [{A,G} {C,T}] Different biological models impose restrictions on the substitution matrices. Our model is the Kimura 2 Parameter (K2P) model:

71 71 K2P rate matrices have the following shape AGTC A- G- T- C- All transitions have rate α All transversions has rate β

72 72 Part 3.2: Distance functions for K2P ( Linear Algebra in the service of Biology)

73 73 μPμP 000 0μPμP 00 00λPλP 0 0001 U -1 P U = μQμQ 000 0μQμQ 00 00λQλQ 0 0001 U -1 Q U = U -1 PQ U = Let P,Q be two matrices in K2P. Then: μ P μ Q 000 0 00 00 λ P λ Q 0 0001 U -1 PQ U =

74 74 U -1 PQ U = 000 000 00 λ 1 (P) 0 0001 λ 2 (P) λ 3 (P)

75 75 000 000 00 λpλp 0 0001 U -1 P U = λpλp λpλp

76 76 ACGGTCA ACGGATA GGGGATT The joint distribution of each pair of vertices provides an approximation of the substitution matrices w v u The common theme of all projects: Start with input sequences for two or more taxa. Find a distance function which minimizes the inaccuracy (noise) introduced by the sampling process.

77 77 Instantaneous Rate Matrix AGTC A-2.51.50.5 G1.5-2.60.60.5 T -32 C0.5 2-3 R uv = A is substituted by G in a rate of 1.5 times per million years (say) u v

78 78 A rate matrix R uv + elapsed time t imply a stochastic substitution matrix P AGTC A0.70.20.05 G0.20.70.05 T0.1 0.60.2 C0.1 0.20.6 P uv (= e tR ) = 20% of the As Will be substituted by G u v

79 79 AGCT A- α ββ G α - ββ C ββ - α T ββ α -

80 80 AGCT A- α`α` β`β`β`β` G α`α` - β`β`β`β` C β`β`β`β` - α`α` T β`β`β`β` α`α` -

81 81 25% ACGGATA K2P Model tree: ====== + + r v R uv

82 82 AGTC A G T C

83 83 AGTC A 1-3p ppp G p pp T pp p C ppp

84 84

85 85 K2P Model tree: ====== + + 0.25 AGCT

86 86 K2P rate matrices have the following shape AGTC A- G- T- C- All transitions have rate α All transversions has rate β

87 87 Given sequences at two adjacent vertices we define the edge length in two steps : vertices C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT u v …TCTGGGA… …GGGGATT… First, align the sequences,

88 88 Natural evolutionary distance: Total substitution rate u v w Each edge is associated with a time t and a K2P rate matrix S. The total substitution rate along an edge of length t is t(α +2β). Total substitution rate between species = sum of the rates over the path connecting them. Total substitution rates are exact distances, which we try to reconstruct from observing the joint distribution of sequences at u and v.

89 89 How do we estimate D K2P (u,v)? vertices C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT Our input are aligned sequences at u and v. They can be used to estimate the probablity that a nucleotide X in u will be replaced by a nucleotide Y in v

90 90 vertices C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT Estimate P uv from the joint distributions: First step in distance estimation: (Maximum Likelihood)

91 91

92 92 Substitution matrix is estimated by the observed difference between the sequences. ACCGTTG TCTGGGA 5 ACGGGTA ACCCGTG TCTGGTA 1 23 2 ACCGTTG TCTGGGA Errors in distance estimations are amplified when: The rate is small: signal is too weak (in extreme cases, there are no substitution whatsoever) The rate is large: recent substitutions overwrite older ones.

93 93 25% ACGGATA K2P Model tree: ====== + + r v R uv

94 94 How reliable Consider balanced quartets. Define the quartet ratio to be the ratio between the middle edge and two external edges.

95 95 The rate matrix S implies a stochastic substitution matrix P uv : u v P uv defines the joint distribution of the sequences at u,v.

96 96 What happens when α = β? (Jukes Cantor) transversion only is just a noisier version of the standard distance

97 97 performance of the standard distance method in reconstructing the split from estimated distances Distance based 4-point method (FPM): Reconstruction will fail if. diam AC BD AB C D AC DB w sep diam

98 98 root D C A B

99 99 Minimizing the expected relative error

100 . - Compute distances between all taxon-pairs - Find a tree (edge-weighted) best-describing the distances Distance based methods: The general scheme 4 5 7 2 1 2 10 6 1 This Talk

101 101 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACCGTTG TCTGGGA TCCGGAAAGCCGTG GGGGATT Phylogenetic Reconstruction

102 . D 1615192190 Adaptive distance based algorithm for the K2P model

103 . - Compute distances between all taxon-pairs - Find a tree (edge-weighted) best-describing the distances Distance based methods: The general scheme 4 5 7 2 1 2 10 6 1 This Talk

104 . 1615192190 D 4 5 7 2 1 2 10 6 1 Find a good distance function - Compute distances between all taxon-pairs - Find a tree (edge-weighted) best-describing the distances Distance based methods: An adaptive scheme Find a distance function d which is good for the input This work

105 . Promotion: Make Distance based methods adaptive

106 106 Summary of previous slides:

107 107 Minimizing the expected relative error This means that equivalent SR functions have the same NMSE

108 108 The known SR functions for M K2P are: Each SR functions in M K2P is a linear combination of these functions. When α>β, the optimal SR functions lies between these two functions: for small distances we use the total rate, for large distances we use transversions only. This is depicted in the following plots.

109 109 Total rate (logdet) Transversion only (λ) values of c 1 / (c 1 +c 2 ) for total rate and transversions only


Download ppt "1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel."

Similar presentations


Ads by Google