Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.

Similar presentations


Presentation on theme: "Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities."— Presentation transcript:

1 Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities for likelihood-based methods. Prob(R r |  ) =   x P m,k (v 3,1 ) x P k,A (v 1,w ) x P k,G (v 1,x ) x P m,l (v 3,2 ) x P l,C (v 2,y ) x P l,C (v 2,z ) It’s the P i,j ’s that we need a substitution model to calculate. The models typically used are Markov processes. Poisson process is a stochastic process that can be used to model events in time. The time between events is exponentially distributed, with rate.

2 Jukes-Cantor Model The probability of a site remaining constant is: p ii(t) = ¼ + ¾ e -4at The probability of a site changing is : p ij(t) = ¼ - ¼ e -4at  is the rate at which any nucleotide changes to any other per unit time. Given that the state at the site is i at t 0, we start by estimating the probability of state i at that site at t 1. p i(0) = 1 p i(1) = 1-3 

3 Now, what’s the probability of this site having state i at t 2 There are two ways for the site to have state i at t 2 : 1 – It still hasn’t changed since time t 0. 2 – It has changed to something else and back again. Therefore, p i(2) = (1 – 3  ) p i(1) +  [1 – p i(1) ], where (1 – 3a) p i(1) = probability of no change at the site during time t 2, (1-3  ), times the probability of the site having state i at time t 1, (p i(1) ). and  [1-p i(1) ] = probability of a change to i, (  ), times the probability that the site is not state i at time t 1, (1-p i(1) ) Jukes-Cantor Model

4 We have a recurrence equation. p i(t+1) = (1 - 3  ) p i(t) +  [1 – p i(t) ] = p i(t) - 3  p i(t) +  –  p i(t) We can calculate the change in p i(t) across time,  t. p i(t+1) – p i(t) = -3  p i(t) +  –  p i(t) so and

5 Jukes-Cantor Model p i(t) = 1/4 + (p i(0) – 1/4) e -4  t We have a probability that a site has a particular nucleotide after time t, given in terms of its initial state. If i = j, p i(0) = 1. Therefore, p ii(t) = 1/4 + 3/4 e -4  t If i not = j, p i(0) = 0, and p ij(t) = 1/4 - 1/4 e -4  t  is an instantaneous rate, so we’ve modeled branch length (rate x time) explicitly in our expectations.

6 The JC model makes several assumptions. 1) All substitutions are equally likely; we have a single substitution type. 2) Base frequencies are assumed to be equal; each of the four nucleotides occurs at 25% of sites. 3) Each site has the same probability of experiencing a substitution as any other; we have an equal-rates model. 4) The process is constant through time. 5) Sites are independent of each other. 6) Substitution is a Markov process.   Q =   Q - matrix

7 Substitution types and base frequencies. -  (a  C + b  G + c  T )  a  C  b  G  c  T  g  A -  (g  A + d  G - e  T )  d  G  e  T Q =  h  A  j  C -  (h  A + j  C + f  T )  f  T  i  A  k  C  l  G -  (i  A + k  C + l  G ) For the general case: where,  = the average instantaneous substitution rate, a, b, c, …, l are relative rate parameters (one of them is set to 1). and  i ’s are the frequencies of the base that is being substituted to. Note that this is not symmetric, and therefore, the full model is non-reversible. a = g, b = h, c = i, d = j, e = k, & f = l.

8 Substitution types and base frequencies. -  (a  C + b  G + c  T )  a  C  b  G  c  T  a  A -  (a  A + d  G + e  T )  d  G  e  T Q =  b  A  d  C -  (b  A + d  C + f  T )  f  T  c  A  e  C  f  G -  (c  A + e  C + f  G ) General Time-Reversible Model There are six relative transformation rates (one of which is set to 1). There are four base frequencies that must sum to 1. Note that this is not a symmetric matrix, but it can be decomposed into R and .

9 Substitution types and base frequencies. -  (a+b+c)  a  b  c  a -  (a+d+e)  d  e R =  b  d -  (b+d+f)  f  c  e  f -  (c+e+f)  A 000 0  C 00  = 00  G 0 000  T Visual GTR

10 Common Simplifications Transition type substitutions occur at a higher rate than transversion substitutions. K2P Model was the first to address this. So we set b = e =  (for transitions), and a = c = d = f = 1 (for transversions). -(  )(  + 2)/4  /4  /4  /4  /4-(  )(  + 2)/4  /4  /4 for K2P: Q =  /4  /4-(  )(  + 2)/4  /4  /4  /4  /4-(  )(  + 2)/4 All  i = ¼ where  =  /4 and  =  /4. Thus,  =  and

11 Hasegawa-Kishino-Yano (HKY) Model -  (  G +  Y )  C  G  T  A -  (  T +  R )  G  for HKY: Q =    C -  (  A +  Y )  T  A  C  G -  (  C +  R ) where  =  R =  A +  G, and  Y =  C +  T. There are lots of other models that restrict the Q-matrix.

12 Some common models There are 203 special cases of the GTR, 406 if we allow for equal base frequencies.

13 Calculating Transformation Probabilities. So the Q & R matrices we’ve been discussing define the instantaneous rates of substitutions from one nucleotide to another. Convert the rates to probabilities by matrix exponentiation: P(t) = e Qt Jukes-Cantor K2P Again, it’s these P ij that are used in the likelihood function.


Download ppt "Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities."

Similar presentations


Ads by Google