Models for DNA substitution
http://www.stat.rice.edu/ ~mathbio/Polanski/stat655/
Plan Basics Models in discrete time Model is continuous time Parameter estimation
Nucleotides Adenine ( A ) or ( a ) Guanine ( G ) or ( g ) purines Cytosine ( C ) or ( c ) Thymine ( T ) or ( t ) purines pyrimidines
Substitution Purine Purine Transitions Pyrimidine Pyrimidine Purine AG, G A, C T, T C Purine Pyrimidine Pyrimidine Purine Transversions AT, T A, A C, C A GT, T G, G C, C G
Other Deletions, insertions Insertions in reverse order
Hypothesis Substitution of nucleotides in the evolution of DNA sequences can be modeled by a Markov chain or Markov process
Other assumptions Stationarity Reversibility
Transition matrix P = a g c t paa pag pac pat a g pga pgg pgc pgt c pca pcg pcc pct t pta ptg ptc ptt
Models – discrete time
Jukes – Cantor model All substitutions are equally probable
Stationary distribution
Spectral decomposition of Pn
Remark When learning and researching Markov models for nucleotide substitution, it greatly helps to use a software for symbolic computation, like Mathematica, Maple, Scientific Workplace.
Kimura models - probability of a transition - probability of a specific transversion
Kimura 3ST model - probability of : AG, C T - probability of : AC, G T - probability of : AT, C G
Stationary distribution
Generalizations of Kimura models By Ewens: - probability of : AG, C T - probability of : AC, A T, G C, G T - probability of : CA, T A, C G, T G
Stationary distribution
Spectral decomposition
By Blaisdell: - probability of : AG, CT - probability of : GA, TC - probability of : AC, A T, G C, G T - probability of : CA, T A, C G, T G
Stationary distribution where Remark: this model is not reversible
Felsenstein model Probability of substitution of any nucleotide by another is proportional to the stationary probability of the substituting nucleotide
Stationary distribution
HKY model Hasegawa, Kishino, Yano Different rates for transitions and transversions
Eigenvalues of P
Left (row) eigenvectors
Right (column) eigenvectors
General 12 parameter model Tavare, 1986
Stationary distribution
Reversibility A=D, B=G, C=J, E=H, F=K, I=L Conclusion – the most general reversible model has 12 – 6 = 6 free parameters
Continuous – time models
Matrix of transition probabilites Q – intensity matrix
Jukes – Cantor model
Spectral decomposition of P(t)
Kimura model
Spectral decomposition of P(t)
Parameter estimation
Jukes – Cantor model Three things are equivalent due to reversibility: Ancestor (A) D2 A D1 D1 A D2 D1 D2
Probability that the nucleotides are different in two descendants
Estimating p We have two DNA sequences of length N D1: ACAATACAGGGCAGATAGATACAGATAGACACAGACAGAGCAGAGACAG D2: ACAATACAGGACAGTTAGATACAGATAGACACAGACAGAGCAGAGACAG Number of differences p = N
Kimura model p – probability of two different purines or pyrimidines q – probability of purine and pyrimidine