Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by.

Similar presentations


Presentation on theme: ". Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by."— Presentation transcript:

1 . Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger. www.cs.huji.ac.il Background Readings: The second chapter (pages 12-45) in the text book, Biological Sequence Analysis, Durbin et al., 2001. Chapter 3.5.1 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.

2 2 Reminder u Last class we discussed dynamic programming algorithms for l global alignment l local alignment u All of these assumed a scoring rule: that determines the quality of perfect matches, substitutions, insertions, and deletions.

3 3 Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database.” u The current protein database contains about 10 8 residues ! So searching a 10 3 long target sequence requires to evaluate about 10 11 matrix cells which will take about three hours in the rate of 10 millions evaluations per second. u Quite annoying when, say, one thousand target sequences need to be searched because it will take about four months to run.

4 4 Heuristic Search u Instead, most searches rely on heuristic procedures u These are not guaranteed to find the best match u Sometimes, they will completely miss a high-scoring match We now describe the main ideas used by the best known of these heuristic procedures.

5 5 Basic Intuition u Almost all heuristic search procedures are based on the observation that real-life matches often contain long strings with gap-less matches. u These heuristic try to find significant gap-less matches and then extend them.

6 6 Banded DP  Suppose that we have two strings s[1..n] and t[1..m] such that n  m u If the optimal alignment of s and t has few gaps, then path of the alignment will be close to diagonal t s

7 7 Banded DP u To find such a path, it suffices to search in a diagonal band of the matrix.  If the diagonal band consists of k diagonals (width k ), then dynamic programming takes O(kn).  Much faster than O(n 2 ) of standard DP. t s k V[i+1, i+k/2 +1]V[i+1, i+k/2] Out of rangeV[i,i+k/2] Note that for diagonals, i-j = constant.

8 8 Banded DP for local alignment Problem: Where is the banded diagonal ? It need not be the main diagonal when looking for a good local alignment. How do we select which subsequences to align using banded DP? t s k We heuristically find potential diagonals and evaluate them using Banded DP. This is the main idea of FASTA.

9 9 Overview of FASTA Input: strings s and t, and a parameter ktup Output: A highly scored local alignment. 1. Find pairs of matching substrings s[i...i+ktup]=t[j...j+ktup] 2. Extend to ungapped diagonals 3. Extend to gapped matches using banded DP

10 10 Finding Potential Diagonals Suppose there exists a relatively long gap-less match S=****AGCGCCATGGATTGAGCGA* T=**TGCGACATTGATCGACCTA** u Each such sequence defines a potential diagonal as follows. If the first sequence starts at location i (e.g.,5 above) and the second starts at location j (e.g.,3 above), then the potential diagonal starts at location (i,j). u Can we identify potential diagonals quickly? u Such diagonals can then be evaluated using Banded DP. t s i j

11 11 Identifying Potential Diagonals Assumption: High scoring gap-less alignments contain several “seeds” of perfect matches S=****AGCGCCATGGATTGAGCGA* T=**TGCGACATTGATCGACCTA** t s i j Since this is a gap-less alignment, all perfect match regions reside on the same diagonal (defined by i-j). How do we find seeds efficiently ?

12 12 Formalizing the task Task at hand (Identifying seeds): Find all pairs (i,j) such that s[i...i +ktup] = t[j...j+ktup] Let ktup be a parameter denoting the seed length of interest.

13 13 Finding Seeds Efficiently Index Table (ktup =2) AA - AC - AG 5, 19 AT 11, 15 CA 10 CC 9,21 CG 7 … TT 16 S=****AGCGCCATGGATTGAGCGA* 510 1520 T=**TGCGACATTGATCGACCTA** 7 (-,7) No match (10,8) One match 89 (11,9), (15,9) Two matches u March on the query sequence T while using the index table to list all matches with the database sequence S. u Prepare an index table of the database sequence S such that for any sequence of length ktup, one gets the list of its positions in S. These steps take linear time: O(|s|+|t|).

14 14 Comments The maximal size of the index table is |  | ktup where  is the alphabet size (4 or 20). For small ktup, the entire table is stored. For large ktup values, one should keep only entries for tuples actually found in the database, so the index table size is indeed linear. In this case, hashing is needed. Typical values of ktup are 1-2 for Proteins and 4-6 for DNA. Tradeoffs of these values to be discussed. The index table is prepared for each database sequence ahead of users’ matching requests, at compilation time. So matching time is O(t). AA - AC - AG 5, 19 AT 11, 15 CA 10 CC 9,21 CG 7 … TT 16 Index table

15 15 t s i j Identifying Potential Diagonals u Input: Sets of pairs. E.g, (6,4),(10,8),(14,12),(15,10),(20,4) … u Task: Locate sets of pairs that are on the same diagonal. 20 i-j = 20-4=16 S=***AGCGCCATGGATTGAGCGA* T=**TGCGACATTGATCGACCTA** i-j = 2; 6-4 ; 10-8; 14-12 6 10 14 4 8 12  Method: Sort according to the difference i-j.

16 16 Processing Potential Diagonals For high i-j offset frequency, namely, diagonals with many pieces, combine the pieces into regions by extending pieces greedily along the diagonal as long as the score improves (and never below some score value). t s

17 17 FASTA’s Final steps: using banded DP l List the highest scoring diagonal matches l Run banded DP on regions containing a high scoring diagonal (say with width 12). t s 3 2 1 Hence, the algorithm may combine some diagonals into gapped matches. In the example above it could combine diagonals 2 and 3).

18 18 FASTA- practical choices Some implementation choices /tricks have not been explicated herein. t s Most applications of FASTA use very small ktup (1-2 for proteins, and 4-6 for DNA). Higher values yield less potential diagonals. Hence to search around potential diagonals (DP) is faster. But the chance to miss an optimal local alignment is increased.

19 19 BLAST Overview Based on similar ideas described earlier (High scoring pairs rather than exact k tuples as seeds). Uses an established statistical framework to determine thresholds. The new PSI-BLAST (Position Specific Iterated – BLAST ) is the state of the art sequence comparison software. Iterative Procedure l Perform BLAST on a database l Uses significant alignments to construct “position specific” score matrix. l This matrix is used in the next round of database searching until no new significant alignments are found. Can sometime detect remote homologs.

20 20 Where do scoring rules come from ? We have defined an additive scoring function by specifying a function  ( ,  ) such that  (x,y) is the score of replacing x by y  (x,-) is the score of deleting x  (-,x) is the score of inserting x But how do we come up with the “correct” score ? Answer: By encoding experience of what are similar sequences for the task at hand. Similarity depends on time, evolution trends, and sequence types.

21 21 Why use probability to define and/or interpret a scoring function ? Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are not deterministic. We could answer questions such as: How probable two sequences are similar? Is the similarity found significant or random? How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

22 22 A Probabilistic Model u For now, we will focus on alignment without indels. u For now, we assume each position (nucleotide /amino-acid) is independent of other positions. u We consider two options: M: the sequences are Matched (related) R: the sequences are Random (unrelated)

23 23 Unrelated Sequences u Our random model of unrelated sequences is simple Each position is sampled independently from a distribution q(  ) over the alphabet  u Then:

24 24 Related Sequences  We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor.  Let p(a,b) be the probability that some ancestral letter evolved into this particular pair of letters.

25 25 Odd-Ratio Test for Alignment If Q > 1, then the two strings s and t are more likely to be related (M) than unrelated (R). If Q < 1, then the two strings s and t are more likely to be unrelated (R) than related (M).

26 26 Score(s[i],t[i]) Log Odd-Ratio Test for Alignment Taking logarithm of Q yields If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. How can we relate this quantity to a score function ?

27 27 Probabilistic Interpretation of Scores u We define the scoring function via u Then, the score of an alignment is the log-ratio between the two models: Score > 0  Model is more likely Score < 0  Random is more likely

28 28 Constructing Scoring Rules The formula suggests how to construct a scoring rule:  Estimate p(·,·) and q(·) from the data  Compute  (a,b) based on the estimated p(·,·) and q(·) u How to estimate these parameters is the subject matter of parameter estimation in Statistics.

29 29 Modeling Assumptions u It is important to note that this interpretation depends on our modeling assumption!! u For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ratio will have a different form. u Next week we will start to discus languages that allow us to employ more complex models (FSA, and Bayesian networks).

30 30 Estimating p(·)  Suppose we are given a long string s[1..n] of letters from  (say, the concatenation of all sequences in a database).  We want to estimate the distribution q(·) Number of times a appears in s Likelihood function: ML parameters MAP parameters

31 31 Estimating p(·,·) Intuition:  Find pair of aligned sequences s[1..n], t[1..n], u Estimate probability of pairs: u Again, s and t can be the concatenation of many aligned pairs from a database Number of times a is aligned with b in (s,t)

32 32 Estimating p(·,·) for proteins Generate a large diverse collection of accepted mutations. An accepted mutation is a mutation due to an alignment of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins). Let p a = n a /n where n a is the number of occurrences of letter a and n is the total number of letters in the collection, so n =  a n a. Mutation counts be the number of mutations a  b, be the total number of mutations that involve a, be the total number of amino acids involved in a mutation. Note that f is twice the number of mutations.

33 33 PAM-1 matrices Define M ab to be the symmetric probability matrix for switching between a and b. We set, M aa = 1 – m a, so that m a is the probability that a is involved in a change. We define M ab, such that only 1% of amino acids change according to this matrix or 99% don’t. Hence the name, 1-Percent Accepted Mutation (PAM). In other words,

34 34 PAM-1 matrices where K is a proportional constant. We wish that m a will be proportional to the relative mutability of letter a compared to other letters. So K=100 for PAM-1 matrices. Note that K=50 yields 2% change, etc. We select K to satisfy the PAM-1 definition:

35 35 Evolutionary distance The choice that 1% of amino acids change (and that K =100) is quite arbitrary. It could fit specific set of proteins whose evolutionary distance is such that indeed 1% of the letters have mutated. This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types. What is the substitution matrix for k units of evolutionary time ?

36 36 Model of Evolution We make some assumptions: 1. Each position changes independently of the rest 2. The probability of mutations is the same in each position 3. Evolution does not “remember” Time t t+  t+2  t+3  t+4  A A C CG T T T CG

37 37 Model of Evolution u How do we model such a process? u This process is called a Markov Chain A chain is defined by the transition probability  P(X t+  =b|X t =a) - the probability that the next state is b given that the current state is a  We often describe these probabilities by a matrix: M[  ] ab = P(X t+  =b|X t =a)

38 38 Multi-Step Changes  Thus M[2  ] = M[  ]M[  ]  By induction (HMW exercise): M[n  ] = M[  ] n  Based on M ab, we can compute the probabilities of changes over two time periods Using Conditional independence (No memory)

39 39 Longer Term Changes  Estimate M[  ] = M (PAM-1 matrices)  Use M[n  ] = M n (PAM-n matrices) u Define u Use this quantity to define the score for your application of interest.

40 40 Comments regarding PAM u Historically researchers use PAM-250. (The only one published in the original paper.) u Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples. u Used to be the most popular scoring rule, but there are some problems with PAM matrices.

41 41 Degrees of freedom in PAM definition With K=100 the 1-PAM matrix is given by With K=50 the basic matrix is different, namely: Use the 1-PAM matrix to the fourth power: M[4  ] = M[  ] 4 Or Use the K=50 matrix to the second power: M[4  ] = M[2  ] 2 Thus we have two different ways to estimate the matrix M[4  ] :

42 42 Problems in building distance matrices u How do we find pairs of aligned sequences? u How far is the ancestor ? earlier divergence  low sequence similarity later divergence  high sequence similarity E.g., M[250  ] is known not reflect well long period changes. u Does one letter mutate to the other or are they both mutations of a third letter ?

43 43 BLOSUM Outline u Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling. u Procedure: l Cluster together sequences in a family whenever more than L% identical residues are shared. l Count number of substitutions across different clusters (in the same family). l Estimate frequencies using the counts. u Practice: Blosum50 and Blosum62 are wildly used. (See page 43-44 in the text book). Considered state of the art nowadays.


Download ppt ". Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by."

Similar presentations


Ads by Google