1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel.

Slides:



Advertisements
Similar presentations
1 Radio Maria World. 2 Postazioni Transmitter locations.
Advertisements

The Fall Messier Marathon Guide
Números.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
SKELETAL QUIZ 3.
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
/ /17 32/ / /
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
Sequential Logic Design
STATISTICS Linear Statistical Models
Addition and Subtraction Equations
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
1 When you see… Find the zeros You think…. 2 To find the zeros...
Western Public Lands Grazing: The Real Costs Explore, enjoy and protect the planet Forest Guardians Jonathan Proctor.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
Summative Math Test Algebra (28%) Geometry (29%)
Introduction to Turing Machines
ASCII stands for American Standard Code for Information Interchange
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Sampling in Marketing Research
The basics for simulations
EE, NCKU Tien-Hao Chang (Darby Chang)
© 2010 Concept Systems, Inc.1 Concept Mapping Methodology: An Example.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Statistics Review – Part I
Progressive Aerobic Cardiovascular Endurance Run
Slide P- 1. Chapter P Prerequisites P.1 Real Numbers.
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
ST/PRM3-EU | | © Robert Bosch GmbH reserves all rights even in the event of industrial property rights. We reserve all rights of disposal such as copying.
Subtraction: Adding UP
Numeracy Resources for KS2
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Static Equilibrium; Elasticity and Fracture
ANALYTICAL GEOMETRY ONE MARK QUESTIONS PREPARED BY:
Resistência dos Materiais, 5ª ed.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
14. Stochastic Processes Introduction
Biostatistics course Part 14 Analysis of binary paired data
UNDERSTANDING THE ISSUES. 22 HILLSBOROUGH IS A REALLY BIG COUNTY.
9. Two Functions of Two Random Variables
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
What impact does the address have on the tribe?
úkol = A 77 B 72 C 67 D = A 77 B 72 C 67 D 79.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
פרויקט בתכנות מחקר השוואתי בשחזור עצי אבולוציה: אלגוריתמים קיימים מול תכנות בשלמים אביב 2013 מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website:
. פרויקט בתכנות מתקדם – פונקציות מרחק אופטימליות לשיחזור עצי אבולוציה סמסטר אביב דואר אלקטרוני חדרטלפון.
Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau.
1 Additive Distances Between DNA Sequences MPI, June 2012.
Presentation transcript:

1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

2 Preview The Phylogenetic Reconstrutction Problem

3 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACGGTCA ACGGATA ACGGGTA ACCCGTG ACCGTTG TCTGGTA TCTGGGA TCCGGAAAGCCGTG GGGGATT AAAGTCA AAAGGCG AAACACA AAAGCTG Evolution is modeled by a Tree (All our sequences are DNA sequences, consisting of {A,G,C,T})

4 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACCGTTG TCTGGGA TCCGGAAAGCCGTG GGGGATT Phylogenetic Reconstruction

5 B : AATCCTG C : ATAGCTG A : AATGGGC D : GAACGTA E : AAACCGA J : ACCGTTG G : TCTGGGA H : TCCGGAA I : AGCCGTG F : GGGGATT Goal: reconstruct the true tree as accurately as possible reconstruct A B C F G IHJ D E A B C F G I H J D E (root) Phylogenetic Reconstruction

6 Misspecified model of evolution Short and deep edges vs. limited seq. length: - Weak signals (short edges) - Decaying signals (deep edges) ACGGATA ACGGGTA ACCGATG What Makes Reconstruction Tough? Assume stochastic model (eg Kimura 2 Parameter)

7 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

8 A C B D F G E edge-weighted true tree reconstructed tree reconstruction B C A D F G E Challange: minimize the effect of noise Introduced by the sampling Distance Based Phylogenetic Reconstruction: Exact vs. Noisy distances Estimated distances Exact (additive) distances Between species Distance estimation using finite Sampling

9 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of known distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

10 The Kimura 2 Parameter (K2P) model [Kimura80]: each edge corresponds to a Rate Matrix Transitions Transversions Transitions K2P generic rate matrix u v

11 K2P standard distance: Δ total = Total substitution rate u vw The total substitution rate of a K2P rate matrix R is This is the expected number of mutations per site. It is an additive distance. + α + 2β (α+α) + 2(β+ β)

12 Estimation of Δ total (R uv ) = d K2P (u,v) is a noisy stochastic process u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT K2P total rate distance correction procedure

13 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

14 Check performance of K2P standard distances in resolving quartet-splits AC BD AB C D AC DB Distance methods reconstruct the true split by 4-point condition: There are 3 possible quartet topologies: w sep The 4-point condition for noisy distances is:

15 We evaluate the accuracy of the K2P distance estimation by Split Resolution Test: root D C A B t is evolutionary time The diameter of the quartet is 22t

16 Phase A: simulate evolution D C A B

17 Phase B: reconstruct the split by the 4p condition DCBA Apply the 4p condition. Was the correct split found? estimate distances between sequences, Repeat this process 10,000 times, count number of failures

18 the split resolution test was applied on the model quartet with various diameters For each diameter, mark the fraction (percentage) of the simulations in which the 4p condition failed (next slide) ……

19 Performance of K2P distances in resolving quartets, small diameters: Template quartet

20 Performance for larger diameters site saturation

21 Transitions Transversions Transitions When β < α, we can postpone the site saturation effect. For this, use another distance function for the same model, Δ tv, which counts only transversions: This is actually the CFN model [Cavendar78, Farris73, Neymann71] α α β

22 Apply the same split resolution test on the transversions only distance: u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT Transversions only Distance correction procedure

23 transversions only performs better on large, worse on small rates Transversions only total K2P rate

Conclusion: Distance based reconstruction methods should be adaptive: Find a distance function d which is good for the input We do a small step in this direction: Input: An alignment of the sequences at u, v. Output: a (near)-optimal distance function, which minimizes the expected noise in the estimation procedure.

25 Example: An adaptive distance method (max-optimal) based on this talk:

26 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and Substitution Rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

27 Steps in finding optimal distance functions: 1.Define substitution model. 2.Characterize the available distance functions. 3.Select a function which is optimal for the input sequences. least sensitive to stochastic noise

28 From Rate matrices to Substitution matrices AACA…GTCTTCGAGGCCC u v AGCA…GCCTATGCGACCT Rate matrices imply stochastic substitution matrices: Evolution of a finite sequence by unknown model parameters α, β A stochastic substitution matrix P uv

29 A substitution model M : A set of stochastic substitution matrices, closed under matrix product: P,Q M PQ M u v w Motivation to the definition: Also required P>0, 0<det(P)<1 for all P M

30 Uniform distribution Model tree over M = + + r v P rv P..

31 Distances for a given model are defined by Substitution Rate functions: u v w Δ : M is an SR function for M iff for all P,Q in M : 1.Δ(PQ) = Δ(P)+ Δ(Q) (additivity) 2.Δ(P)>0 (positivity)

32 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

33 1 st question: Given a model M, what are its SR functions? X additive SR functions are additive functions which are strictly positive

34 Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, M univ : M univ = {P: P is a stochastic 4 4 matrix, 0<det(P)<1}.

35 Example 2: The log eigenvalue function

36 Both logdet and the log eigenvalue functions are special cases of a general technique: Generalized logdet which is given below:

37 Linearity of additive functions: 1.If Δ 1 and Δ 2 are additive functions for M, so is c 1 Δ 1 + c 2 Δ 2 The set of additive functions for M forms a vector space, to be denoted AD M. Dimension(AD M ) is the dimension of this vector space. Large dimension implies more independent distance functions If dimension(AD M ) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(AD M ) > 1.

38 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models: Models which the adaptive approach is potentially useful. Optimizing Distances in the K2P model Simulation results

39 Unified Substitution Models: U -1 PU = Def: A model M is unified if there is a matrix U s.t. for each P M it holds that: Using Lemma GLD, we have:

40 Strongly Unified Substitution Models U -1 PU = Def: A model M is strongly unified if there is a matrix U s.t. for each P M it holds that:

41 A simple strongly unified model: The Jukes Cantor model [1969] M JC = For all P M JC, U -1 PU = :0< p <0.25 M JC is strongly unified by U= Claim dimension(AD M JC )=1 Hence the adaptive approach is irrelevant to this model.

42 Another model M for which dimension(AD M )=1 Recall: M univ consists of all DNA transition matrices. Claim 2: dimension(AD M univ ) = 1 This means that all the additive functions of M univ are proportional to logdet. Hence the adaptive approach is irrelevant also to this model. Luckily, the additive functions of intermediate unified models have dimensions > 1, hence the adaptive approach is useful for them. Next we return to the Kimura 2 parameter model.

43 Back to K2P: For every K2P Substitution Matrix P: λPλP 00 00μPμP 0 000μPμP Where: λ P = 1 - 4P β = e -4β μ P = 1 - 2P β - 2P α = e -2α-2β U -1 PU = P = 0 < λ P <1 0 < μ P < 1 Conclusion: dimension ( AD M K2P )=2. U of the JC model

44 The functions: Δ λ (P)= -ln(λ P ), Δ μ (P)=-ln(μ P ) Form a basis of AD K2P u v The standard total rate distance is: Δ K2P (P)=-(ln(λ P )+2ln(μ P ))/4=-Δ logdet (P)/4. The transversion only distance is: Δ tr (P)=-ln(λ P )/4.

. The Adaptive distance based algorithm for the K2P model ACCGTTG AGCCGTG

46 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

47 u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT K2P distance estimation: where the noise comes from inherent noise implied noise propagation user controlled noise propagation

48 u v Selection of c 1, c 2 True distance Expected error Estimated distance +=

49 Expected Relative Error True distance Expected error = =

50 Minimizing the expected relative error

51 This means that equivalent SR functions have the same NMSE A basic property of Normalized Mean Square Error:

52 A Proper Disclosure on our optimal functions:

53 Relation between c and SR functions : Function nameFunctionc c/(1+c) Total rate (logdet) -ln(λ P )-2ln(μ P ) 1/21/3 Transversions only -ln(λ P ) 1

54 α=20β Optimal values of c opt / (1+c opt ) for ti/tv ratio = 10 As the rate grows, the relative weight of the transversion coefficient increases

55 α=2β α=4βα=20β Optimal values of c 1 / (c 1 +c 2 ) for various transitions/transversion rates α=βα=β α>>β,rate>2 α=200β

56 Expected Relative error of various distance functions: theoretical prediction Total rate transversions optimal

57 Road Map Distance based reconstruction algorithms The Kimura 2 Parameter (K2P) Model Performance of distance methods in the K2P model Substitution models and substitution rate functions Properties of SR functions Unified Substitutions Models Optimizing Distances in the K2P model Simulation results

58 Expected Relative error of various distance functions: simulations Total rate Transversions only optimal small eigenvalue distortion

59 Back to the K2P quartet resolution A heuristic distance method (max-optimal) based on this talk: Select a distance function which is optimal w.r.t. the largest of the six observed distances of the quartet (ie, largest c opt ). Recall the performance of the two known distance function on the template quartet

60 When α β, the suggested heuristic performs better than both known methods.

61 Summary Adaptive approach to distance based reconstructions: adjust distance function to input sequences. Distance functions for stochastic evolutionary models are defined by SR functions. SR functions can be constructed by Generalized Logdet. When the dimension of the space of SR functions is greater than 1, the adaptive approach is applicable. The adaptive approach is applicible to non-trivial unified models. Most common models are unified. An analysis of the simplest non-trivial unified model - K2P - shows a significant improvements in the accuracy of the adaptive approach.

62 Further Research u Prove/Disprove: For any substitution model M, all the additive functions of M are GLD functions. u In the K2P model: l Define&find optimal SR functions for: two distances, quartets, general trees. l Find optimal SR functions for non-homogenous model trees l Find optimal SR functions to variable rates cross sites. u Find optimal SR functions for more general evolutionary models (Tamura Nei) (analytic/heuristic methods) u Empirical/analytical study of plugging adaptive distances in common reconstruction algorithms (eg NJ). u Study improvement in performance on real biological data. u Devise algorithms which use distance-vectors

63

64

65 Further research questions We have infinitely many additive distance functions for the K2P model. Which one should we use for reconstructing the tree? If we have the exact substitution matrices for all pairs of taxa, then all functions are equally good. But we have only finite sequences, whose alignments provide only estimations of the true substitution matrices

66 Distances are defined by Substitution Rate functions u v w For each tree path u vw It holds that D(u,v)+D(v,w)=D(u,w). D(u,v)D(u,v) D(v,w)D(v,w) D(u,w)= D(u,v)+D(v,w)

67 Part 3.1: from Substitution models to Additive distances

68 The aligned sequences provide for each pair of DNA letters, say A and G, how many times A was mutated to G This defines a joint distribution matrix F Aligned Sequences joint distribution matrices AGTC A G T C F = A is aligned with G In 5% of the pairs

69 Joint Distribution matrices are converted to distances by Substitution models. These models describe how DNA sequences are transformed during the evolution. The tool used for this is called Markovian Processes. In the following we will sketch it. Additional reading is recommended…

70 species C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT K2P Distinguish between two mutations types: Transitions {A G, C T} And Transversions [{A,G} {C,T}] Different biological models impose restrictions on the substitution matrices. Our model is the Kimura 2 Parameter (K2P) model:

71 K2P rate matrices have the following shape AGTC A- G- T- C- All transitions have rate α All transversions has rate β

72 Part 3.2: Distance functions for K2P ( Linear Algebra in the service of Biology)

73 μPμP 000 0μPμP 00 00λPλP U -1 P U = μQμQ 000 0μQμQ 00 00λQλQ U -1 Q U = U -1 PQ U = Let P,Q be two matrices in K2P. Then: μ P μ Q λ P λ Q U -1 PQ U =

74 U -1 PQ U = λ 1 (P) λ 2 (P) λ 3 (P)

λpλp U -1 P U = λpλp λpλp

76 ACGGTCA ACGGATA GGGGATT The joint distribution of each pair of vertices provides an approximation of the substitution matrices w v u The common theme of all projects: Start with input sequences for two or more taxa. Find a distance function which minimizes the inaccuracy (noise) introduced by the sampling process.

77 Instantaneous Rate Matrix AGTC A G T -32 C R uv = A is substituted by G in a rate of 1.5 times per million years (say) u v

78 A rate matrix R uv + elapsed time t imply a stochastic substitution matrix P AGTC A G T C P uv (= e tR ) = 20% of the As Will be substituted by G u v

79 AGCT A- α ββ G α - ββ C ββ - α T ββ α -

80 AGCT A- α`α` β`β`β`β` G α`α` - β`β`β`β` C β`β`β`β` - α`α` T β`β`β`β` α`α` -

81 25% ACGGATA K2P Model tree: ====== + + r v R uv

82 AGTC A G T C

83 AGTC A 1-3p ppp G p pp T pp p C ppp

84

85 K2P Model tree: ====== AGCT

86 K2P rate matrices have the following shape AGTC A- G- T- C- All transitions have rate α All transversions has rate β

87 Given sequences at two adjacent vertices we define the edge length in two steps : vertices C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT u v …TCTGGGA… …GGGGATT… First, align the sequences,

88 Natural evolutionary distance: Total substitution rate u v w Each edge is associated with a time t and a K2P rate matrix S. The total substitution rate along an edge of length t is t(α +2β). Total substitution rate between species = sum of the rates over the path connecting them. Total substitution rates are exact distances, which we try to reconstruct from observing the joint distribution of sequences at u and v.

89 How do we estimate D K2P (u,v)? vertices C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT Our input are aligned sequences at u and v. They can be used to estimate the probablity that a nucleotide X in u will be replaced by a nucleotide Y in v

90 vertices C1C1 C2C2 C3C3 C4C4 …CmCm u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT Estimate P uv from the joint distributions: First step in distance estimation: (Maximum Likelihood)

91

92 Substitution matrix is estimated by the observed difference between the sequences. ACCGTTG TCTGGGA 5 ACGGGTA ACCCGTG TCTGGTA ACCGTTG TCTGGGA Errors in distance estimations are amplified when: The rate is small: signal is too weak (in extreme cases, there are no substitution whatsoever) The rate is large: recent substitutions overwrite older ones.

93 25% ACGGATA K2P Model tree: ====== + + r v R uv

94 How reliable Consider balanced quartets. Define the quartet ratio to be the ratio between the middle edge and two external edges.

95 The rate matrix S implies a stochastic substitution matrix P uv : u v P uv defines the joint distribution of the sequences at u,v.

96 What happens when α = β? (Jukes Cantor) transversion only is just a noisier version of the standard distance

97 performance of the standard distance method in reconstructing the split from estimated distances Distance based 4-point method (FPM): Reconstruction will fail if. diam AC BD AB C D AC DB w sep diam

98 root D C A B

99 Minimizing the expected relative error

. - Compute distances between all taxon-pairs - Find a tree (edge-weighted) best-describing the distances Distance based methods: The general scheme This Talk

101 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACCGTTG TCTGGGA TCCGGAAAGCCGTG GGGGATT Phylogenetic Reconstruction

. D Adaptive distance based algorithm for the K2P model

. - Compute distances between all taxon-pairs - Find a tree (edge-weighted) best-describing the distances Distance based methods: The general scheme This Talk

D Find a good distance function - Compute distances between all taxon-pairs - Find a tree (edge-weighted) best-describing the distances Distance based methods: An adaptive scheme Find a distance function d which is good for the input This work

. Promotion: Make Distance based methods adaptive

106 Summary of previous slides:

107 Minimizing the expected relative error This means that equivalent SR functions have the same NMSE

108 The known SR functions for M K2P are: Each SR functions in M K2P is a linear combination of these functions. When α>β, the optimal SR functions lies between these two functions: for small distances we use the total rate, for large distances we use transversions only. This is depicted in the following plots.

109 Total rate (logdet) Transversion only (λ) values of c 1 / (c 1 +c 2 ) for total rate and transversions only