. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger and Shlomo Moran. Background Readings: The second chapter (pages 12-45) in the text book, Biological Sequence Analysis, Durbin et al., Chapter 3 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.
2 Score(s[i],t[i]) Log Odd-Ratio Test for Alignment Taking logarithm of Q yields If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. How can we relate this quantity to a score function ?
3 Estimating p(·,·) for proteins Generate a large diverse collection of accepted mutations. An accepted mutation is a mutation due to an alignment of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins). Let p a = n a /n where n a is the number of occurrences of letter a and n is the total number of letters in the collection, so n = a n a. Mutation counts be the number of mutations a b, be the total number of mutations that involve a, be the total number of amino acids involved in a mutation. Note that f is twice the number of mutations.
4 PAM-1 matrices Define M ab to be the symmetric probability matrix for switching between a and b. We set, M aa = 1 – m a, so that m a is the probability that a is involved in a change. We define M ab, such that only 1% of amino acids change according to this matrix or 99% don’t. Hence the name, 1-Percent Accepted Mutation (PAM). In other words,
5 PAM-1 matrices where K is a proportional constant. We wish that m a will be proportional to the relative mutability of letter a compared to other letters. So K=100 for PAM-1 matrices. Note that K=50 yields 2% change, etc. We select K to satisfy the PAM-1 definition:
6 Evolutionary distance The choice that 1% of amino acids change (and that K =100) is quite arbitrary. It could fit specific set of proteins whose evolutionary distance is such that indeed 1% of the letters have mutated. This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types. What is the substitution matrix for k units of evolutionary time ?
7 Model of Evolution We make some assumptions: 1. Each position changes independently of the rest 2. The probability of mutations is the same in each position 3. Evolution does not “remember” Time t t+ t+2 t+3 t+4 A A C CG T T T CG
8 Model of Evolution u How do we model such a process? u This process is called a Markov Chain A chain is defined by the transition probability P(X t+ =b|X t =a) - the probability that the next state is b given that the current state is a We often describe these probabilities by a matrix: M[ ] ab = P(X t+ =b|X t =a)
9 Multi-Step Changes Thus M[2 ] = M[ ]M[ ] By induction (HMW exercise): M[n ] = M[ ] n Based on M ab, we can compute the probabilities of changes over two time periods Using Conditional independence (No memory)
10 A Markov Model (chain) X1X1 X2X2 X n-1 XnXn Every variable x i has a domain. For example, suppose the domain are the letters {a, c, t, g}. Every variable is associated with a local probability table P(X i = x i | X i-1 = x i-1 ) and P(X 1 = x 1 ). The joint distribution is given by In short, we write: where Pa i are the parents of variable/node X i,namely, none or X i-1.
11 Markov Model of Evolution Revisited In the evolution model we studied earlier we had P(x 1 ) = (p a, p c, p g, p t ) which sum to 1 and called the prior probabilities, and P(x i |x i-1 ) = M[ ] which is a stationary transition probability table, not depending on the index i. The quantity we computed earlier from this model was the joint probability table X1X1 X2X2 X n-1 XnXn M M
12 Longer Term Changes Estimate M[ ] = M (PAM-1 matrices) Use M[n ] = M n (PAM-n matrices) u Define u Use this quantity to define the score for your application of interest.
13 Comments regarding PAM u Historically researchers use PAM-250. (The only one published in the original paper.) u Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples. u Used to be the most popular scoring rule, but there are some problems with PAM matrices.
14 Degrees of freedom in PAM definition With K=100 the 1-PAM matrix is given by With K=50 the basic matrix is different, namely: Use the 1-PAM matrix to the fourth power: M[4 ] = M[ ] 4 Or Use the K=50 matrix to the second power: M[4 ] = M[2 ] 2 Thus we have two different ways to estimate the matrix M[4 ] :
15 Problems in building distance matrices u How do we find pairs of aligned sequences? u How far is the ancestor ? earlier divergence low sequence similarity later divergence high sequence similarity E.g., M[250 ] is known not reflect well long period changes. u Does one letter mutate to the other or are they both mutations of a third letter ?
16 BLOSUM Outline u Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling. u Procedure: l Cluster together sequences in a family whenever more than L% identical residues are shared. l Count number of substitutions across different clusters (in the same family). l Estimate frequencies using the counts. u Practice: Blosum50 and Blosum62 are wildly used. (See page in the text book). Considered state of the art nowadays.
17 Multiple Sequence Alignment S 1 =AGGTC S 2 =GTTCG S 3 =TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC
18 Multiple Sequence Alignment Definition: Given strings S 1, S 2, …,S k a multiple (global) alignment map them to strings S’ 1, S’ 2, …,S’ k that may contain blanks, where: 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.The removal of spaces from S’ i leaves S i Aligning more than two sequences.
19 Multiple alignments We use a matrix to represent the alignment of k sequences, K=(x 1,...,x k ). We assume no columns consists solely of blanks. MQ_ILLL MLR-LL- MK_ILLL MPPVLIL The common scoring functions give a score to each column, and set: score(K)= ∑ i score(column(i)) For k=10, a scoring function has 2 k -1 > 1000 entries to specify. One need not have a general function. For example, the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V). x1x1 x2x2 x3x3 x4x4
20 SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑ i<j score(x i,x j ). In order for this score to be written as ∑ i score(column(i)), we must specify score(-,-)=0. Why ? Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines). Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).
21 SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL Definition: The sum-of-pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all projected pairwise alignments induced by A where the pairwise alignment function score(x i,x j ) is additive.
22 Example Consider the following alignment: a c - c d b - - c - a d b d a - b c d a d Using the edit distance and for, this alignment has a SP value of = 12
23 Multiple Sequence Alignment Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizes SP-score(K) = ∑ i<j score(x i,x j ). Instead of a 2-dimensional table, we now have a k-dimensional table to fill. For each vector i =(i 1,..,i k ), compute an optimal multiple alignment for the k prefix sequences x 1 (1,..,i 1 ),...,x k (1,..,i k ). The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2 k -1 adjacent entries.
24 The idea via K=2 V[i,j]V[i+1,j] V[i,j+1]V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2 k -1 non-zero binary vectors (1,1), (1,0), (0,1). Recall the notation: and the following recurrence for V :
25 The idea for arbitrary k Order the vectors i=(i 1,..,i k ) by increasing order of the sum ∑i j. Set s(0,..,0)=0, and for i > (0,...,0): The vector b ranges over all non-zero binary vectors. The vector i-b is the non-negative difference of i and b. The j th entry of column(i,b) equals c j = x j (i j ) if b i =1, and c j = ‘-’ otherwise. (Reflecting that b is 1 at location j if that location changed in the “current comparison”). Where
26 Complexity of the DP approach Number of cells n k. Number of adjacent cells O(2 k ). Computation of SP score for each column(i,b) is o(k 2 ) Total run time is O(k 2 2 k n k ) which is utterly unacceptable ! Not much hope for a polynomial algorithm because the problem has been shown to be NP complete. Need heuristic to reduce time.
27 Time saving heuristics: Relevance tests Heuristic: Avoid computing score(i) for irrelevant vectors. MQ_ILLL MLR-LL- MK_ILLL MPPVLIL x1x1 x2x2 x3x3 x4x4 Let L be a lower bound on the optimal SP score of a multiple alignment of the k sequences. A lower bound L can be obtained from an arbitrary multiple alignment, computed in any way. Main idea: Using L, compute lower bounds L uv for the optimal score for every two sequences s=x u and t=x v, 1 u < v k. When processing vector i=(..i u,..i v …), the relevant cells are such that in every projection on x u and x v, the optimal pairwise score is above L uv.
28 Recall the Linear Space algorithm u V[i,j] = d(s[1..i],t[1..j]) u B[i,j] = d(s[i+1..n],t[j+1..m]) F[i,j] + B[i,j] = score of best alignment through (i,j) t s These computations done in linear space. Build such a table for every two sequences s=x u and t=x v, 1 u, v k. This entry encodes the optimum through (i u,i v ).
29 Time saving heuristics: Relevance test But can we go over all cells determine if they are relevant or not ? No. Start with (0,…,0) and add to the list relevant entries until reaching (n 1,…,n k )
30 Star Alignments Rather then summing up all pairwise alignments, select a fixed sequence x 0 as a center, and set Star-score(K) = ∑ j>0 score(x 0,x j ). The algorithm to find optimal alignment: at each step, add another sequence aligned with x 0, keeping old gaps and possibly adding new ones.
31 Tree Alignments Assume that there is a tree T=(V,E) whose leaves are the sequences. Associate a sequence in each internal node. Tree-score(K) = ∑ (i,j) E score(x i,x j ). Finding the optimal assignment of sequences to the internal nodes is NP Hard. We will meet again this problem in the next topic: Phylogenetic trees
32 Multiple Sequence Alignment – Approximation Algorithm In tutorial time you will see an O(k 2 n 2 ) multiple alignment algorithm that errs by a factor of at most 2(1-1/k) < 2.