Substitution Numbers and Scoring Matrices
Substitution Numbers The number of observed substitutions K is an important quantity in molecular evolutionary analysis A simple count may be misleading, so statistical models are developed to estimate the number of substitutions Jukes-Cantor model Kimura model (both are for nucleotides, but the ideas can extend to amino acids)
Jukes-Cantor Model Assumes that each nucleotide is equally likely to change into any other nucleotide with probability α per time step What is the probability that if we start with C we end up with C after 2 time steps Pcc(2) = C -> C -> C C -> A -> C C -> T -> C C-> G -> C α A T C G φ α G A α α α φ = 1 - 3 α α T C α
Jukes-Cantor Model The entry M(a,b) in the matrix M1 represents the probability of substitution from nucleotide a to b in one time step What is the matrix M2, i.e. whose entries M(a,b) represent the probability of substitution from a to b in two time steps essentially what we did on prev. slide but for all pairs of bases A->X->A A->X->T A->X->C A->X->G T->X->A T->X->T T->X->C T->X->G C->X->A C->X->T C->X->C C->X->G G->X->A G->X->T G->X->C G->X->G A T C G φ α C->X->C = α∙α + α∙α + φ∙φ + α∙α (prev. slide) C->X->A = α∙φ + α∙α + φ∙α + α∙α M1 = φ = 1 - 3 α
Jukes-Cantor Model Turns out that Mn = (M1)n i.e. whose entries M(a,b) represent the probability of substitution from a to b in n time steps In general under the J.C. model the probability that a site will contain a C after t time steps is given by: Pc(t) = ¼ + (¾)e-4αt This model can be used to derive an estimate of the number of substitutions that have occurred between the sequences K = -¾ ln[ 1 – (4/3) p ] p – the fraction of nucleotides that are considered mismatch
Kimura Model Addresses the unrealistic assumption in J.C. model that all substitutions are equally likely Two types of substitutions transitions – purine<=>purine exchange or pyrimidine<=>pyrimidine transversions – purine<=>pyrimidine exchange α A T C G φ β Α α G A β β β φ = 1 – α – 2 β β T C α
Kimura Model What is the probability that if we start with C we end up with C after 2 time steps Pcc(2) = C -> C -> C C -> A -> C C -> T -> C C-> G -> C In general under the Kimura model the probability that a site will contain a C after t time steps is given by: Pc(t) = ¼ + (¼)e-4βt + (½)e-2(α+β)t Estimated number of substitutions (TR – transitions, TV – transverions) K = ½ ln[ 1 / (1 – 2*TR – TV)] + ¼ ln[ 1 / (1 – 2*TV)]
Scoring Matrices
Alignment Score Alignment score attempts to measure likelihood of a common evolutionary ancestor Two possible ways to explain a given pairwise alignment random model – the alignment could be produced purely by chance evolutionary model – there is high correlation between aligned pairs Under random model each position is independent of the others probability of amino acid a occurring at each position is pa Under non-random model probability of amino acid a depends on matched residue b – qab
Substitution Matrices Given a (non-gapped) pairwise alignment of sequences A = a1 a2 a3 a4…an B = b1 b2 b3 b4…bn under non-random model probability of the alignment Pnon-random = qa1b1qa2b2qa3b3qa4b4…qanbn under random model probability of the alignment Prandom = pa1pa2pa3pa4…pan pb1pb2pb3pb4…pbn = pa1pb1pa2pb2pa3pb3qa4pb4…panpbn Use ratio of probabilities (odds ratio) to compare the models r = –––––––– r > 1, non-random more likely Pnon-random Prandom
Substitution Matrices Ratio of probabilities (odds ratio) r = –––––––– = –––––––––––––––––––––––––––––– = –––––––––––––––––––––––––––––– Typically the log-odds ratio is used log(r) = log( –––––––––––––––––––––––––––––– ) = log(––––––)+log(––––––)+log(––––––)+ ... +log(––––––) Pnon-random qa1b1qa2b2qa3b3qa4b4 …qanbn Prandom pa1pb1pa2pb2pa3pb3qa4pb4…panpbn qa1b1 qa2b2 qa3b3 qa4b4 … qanbn pa1pb1 pa2pb2 pa3pb3 qa4pb4 …panpbn qa1b1 qa2b2 qa3b3 qa4b4 … qanbn pa1pb1 pa2pb2 pa3pb3 qa4pb4 …panpbn Entry (a1, b1) in the substitution matrix qa1b1 qa2b2 qa3b3 qanbn pa1pb1 pa2pb2 pa3pb3 panpbn
Substitution Matrices Provide the “likelihood” that two amino acids (nucleotides) will occur as aligned pair Common substitution matrices for protein alignment PAM family – derived from alignments of high sequence identity (Dayhoff, Schwartz, and Orcutt. “A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure Volume 5. 1978:345-352) BLOSUM family – derived from alignments of low sequence identity (Henikoff and Henikoff. “Amino acid substitution matrices from protein blocks”. Proc. Natl. Acad. Sci. 1992. 89(22): 10915–10919.) BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
BLOSUM Matrices Based on ungapped multiple local alignments of conserved regions of proteins with low sequence identity These alignments are used to derive qab pa pb which give the substitution score for amino acids a and b score(a, b) = log(––––––) Procedure obtain known ungapped multiple local alignments split into clusters, so that every pair in a cluster has ≥ C% identity for each pair of amino acids a and b calculate qab = frequency of a,b pair / total # pairs (sequences within a cluster are given weight 1 / size_of_cluster) qab papb
BLOSUM Matrices Calculating qQN for BLOSUM62 – within a cluster for each sequence there is one with (≥ 62% identity) ATCKQ ATCRN ASCKN SSCRN SDCEQ SECEN TECRQ 7 clusters, 21 pairs of clusters 5*21 = 105 total # of aligned pairs QN matched in 12 pairs of clusters qQN = frequency of QN pair / total # aligned pairs = 12 / 105 = 0.114
BLOSUM Matrices Calculating qQN for BLOSUM50 – within a cluster for each sequence there is one with (≥ 50% identity) ATCKQ ATCRN ASCKN SSCRN SDCEQ SECEN TECRQ 3 clusters, 3 pairs of clusters 5 bases * 3 clusters = 15 total # of aligned pairs QN match frequency (between clusters): top, mid: top, bot: mid, bot: total: qQN = frequency of QN pair / total # aligned pairs = 14/8 / 15 = 0.1166
BLOSUM Matrices Calculating qQN for BLOSUM50 – within a cluster for each sequence there is one with (≥ 50% identity) ATCKQ ATCRN ASCKN SSCRN SDCEQ SECEN TECRQ 3 clusters, 3 pairs of clusters 5 bases * 3 clusters = 15 total # of aligned pairs QN match frequency (between clusters): top, mid: ¼*½ + ¾*½ top, bot: ¾*1 mid, bot: ½*1 total: 1/8+3/8+3/4+1/2 = 14/8 qQN = frequency of QN pair / total # aligned pairs = 14/8 / 15 = 0.1166
BLOSUM Matrices So far calculated qabN (i.e. probability that a and b will be paired up under non-random model) To compute the substitution score need to know pa and pb (i.e. probability that a and b occur by chance) pa = qaa + ½ Σa≠bqab ≈ fraction of all amino acids that are type a The entry computed in the substitution matrix is: qab score(a, b) = log(––––––) papb
PAM Matrices Based on ungapped multiple local alignments of conserved regions of proteins with high sequence identity (> 85%) Uses phylogenetic trees to compute the entries in the substitution matrix Procedure build a phylogenetic tree for sequence of high identity compute relative mutability, ma, of each amino acid (frequency of a substitutions in the phylogenetic tree) compute Fab (number of substitutions of a with b) compute Mab (mutation probability that a will be replaced by b) Mab = mb Fab / ΣcFcb compute entry in scoring matrix score(a, b) = log(Mab / frequency of a)
PAM Matrices Constructing a PAM matrix ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL A->G I->L A->G A->L C->S G->A Compute score(G, A) – need mA, FGA, ΣcFcA ma = 4 / 2*6 FGA = 3 Σ FcA = 4 Mab = mA FGA / ΣcFcA score(G, A) = log(MGA/ frequency_of_G) = log(MGA/ (10/63))