1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.

1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

2 Outline Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms The importance of (sub)sequence comparison in molecular biology The edit distance between two strings Dynamic Programming String similarity Computing alignments in linear space Local alignment gaps

3 Motivation The area of approximate matching and sequence comparison is central in computational molecular biology both because of active mutational processes that (sub)sequence comparison methods seek to model and reveal. Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Much of computational biology concerns sequence alignments

4 The importance of (Sub)sequence comparison in Molecular Biology Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms The first fact of biology sequence analysis In biomulecular sequences (DNA,RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity. “Redundancy”, and “similarity” are central phenomena in biology. But similarity has its limits – humans differ in some respects. These differences make conserved similarity even more significant, which in turn makes comparison and analogy very powerful tools in biology.

5 The importance of (Sub)sequence comparison in Molecular Biology Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms “... Similar sequences yield similar structures, but quite distinct sequences can produce remarkably similar structures”. F. E. Choen. Folding the sheets: using computational methods to predict structures of proteins. In E. Lander and M.S. Waterman, editors, Calculating the Secrets of Life, pages 236-71. National Academy Press, 1995.

6 Terminology Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Approximate – some errors, of various types detailed later, are acceptable in valid matches. Alignment – lining up characters of strings, allowing mismatches as well as matches and allowing characters of one string to be placed opposite spaces made in opposing strings. dbd_caq _b_xwaq

7 Terminology Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Subsequence versus Substring : A subsequence differs from a substring in that the characters in a substring must be contiguous, whereas the characters in a subsequence embedded in a string need not be. For example, the string xyz is a subsequence, but not a substring, in axayaz.

8 Dynamic Programming Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Dynamic programming is typically applied to optimization problems. The development of a dynamic programming algorithm can be broken into a sequence of four steps: i. Characterize the structure of an optimal solution. ii. Recursively define the value of an optimal solution. iii. Compute the value of an optimal solution in a bottom-up fashion. iv. Construct an optimal solution from computed information.

9 Edit Distance Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Instance: 2 sequences x[1..m] and y[1..n], and set of operation costs. Problem: To find what is the cost of the least expensive transformation sequence that converts x to y.

10 The edit distance between two strings Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms The permitted edit operations are: Insertion, Deletion, Replacement. Definition: A string over the alphabet I,D,R,M that describes a transformation of one string to another is called edit transcript, or transcript for short, of the two strings. IMMDMDMIR rentniv sretirw Match

11 The edit distance between two strings Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: The edit distance between two strings is defined as the minimum number of edit operations – insertion, deletion, and substitutions – needed to transform the first string into the second. For emphasis, note that matches are not counted.

12 String alignment Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: A (global) alignment of two strings S 1 and S 2, is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S 1 and S 2, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.

13 String alignment Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Example - the alignment of the string qacdbd and qawxb: dbd_caq _b_xwaq

14 Alignment Versus edit transcript Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms From the mathematical standpoint – equivalent ways to describe a relationship between two strings. From a modeling standpoint – an edit transcript emphasize the putative mutational events that transform one string to another, whereas an alignment only displays a relationship between the two strings

15 Dynamic programming calculation of edit distance Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: For two strings S 1 and S 2, D(i,j) is defined to be the edit distance of S 1 [1..i] and S 2 [1..j]. D(i,j) denotes the minimum number of edit operations needed to transform the first i characters of S 1 into the first j characters of S 2. D(n,m) – the edit distance of S 1 and S 2

16 The recurrence relation Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms The base conditions are: D(i,0) = i;D(0,j) = j The recurrence relation for D(i,j) when both i and j are strictly positive is: D(i,j)=min[D(i-1,j)+1, D(i,j-1)+1,D(i-1,j-1)+t(i,j)] where t(i,j) is defined to have value 1 if S 1 (i)≠S 2 (j), and 0 otherwise.

17 Correctness of the general recurrence Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Lemma 1: The value of D(i,j) must be D(i-1,j)+1, D(i,j- 1)+1, or D(i-1,j-1)+t(i,j). There are no other possibilities. Lemma 2: D(i,j)≤min[D(i-1,j)+1, D(i,j-1)+1,D(i-1,j-1)+t(i,j)] Theorem: When both i and j are strictly positive, D(i,j)= min[D(i-1,j)+1, D(i,j-1)+1,D(i-1,j-1)+t(i,j)].

18 Tabular computation of edit distance Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Top-down computation. efficiently compute the value D(n,m). (n+1) × (m+1) combinations of i and j. Redundant recursive. Bottom-up computation. Time analysis: O(nm) cells in the table.

19 Tabular computation of edit distance Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

20 Tabular computation of edit distance Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Theorem: The dynamic programming table for computing the edit distance between a string of length n and a string of length m can be filled in with O(nm) work. Hence, using dynamic programming, the edit distance D(n,m) can be computed in O(nm) time.

21 The traceback Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms When the value of cell (i,j) is computed set a pointer according the following rules: If D(i,j)=D(i,j-1)+1 (i,j)  (i,j-1) If D(i,j)=D(i-1,j)+1 (i,j)  (i-1,j) If D(i,j)=D(i-1,j-1)+t(i,j) (i,j)  (i-1,j-1) For optimal edit transcript, follow any path of pointers from cell (n,m) to cell (0,0).

22 The traceback Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

23 The traceback Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Horizontal edge for insertion. Vertical edge for deletion. Diagonal edge for substitution if S 1 (i)≠S 2 (j), and match otherwise.

24 The traceback Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Theorem: Once the dynamic programming table with pointers has been computed, an optimal edit transcript can be found in O(n+m) time.

25 The traceback Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Theorem: Any path from (n,m) to (0,0) following pointers established during the computation of D(i,j) specifies an edit transcript with the minimum number of edit operations, any optimal edit transcript is specified by such a path. Moreover, since a path describes only one transcript, the correspondence between paths and optimal transcripts is one-to-one.

26 Edit graphs Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: Given two strings S 1 and S 2 of length n and m, respectively, a weighted edit graph has (n+1)×(m+1) nodes, each labeled with distinct pair (i,j) (0≤i≤n, 0≤j≤m). The specific edges and their edge weights depend on the specific string problem. For the edit distance problem: The weight on the edges (i,j)  (i,j+1) and (i,j)  (i+1,j) is one The weight on the edges (i,j)  (i+1,j+1) is t(i+1,j+1). A N N 0 1 2 3 0 C 1 A 2 N 3 0 00

27 Weighted edit distance Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: With arbitrary operation weights, the operation-weight edit distance problem is to find an edit transcript that transform string S 1 into S 2 with the minimum total operation weight. For example: if each mismatch has a weight of 2, each space has a weight of 4, and each match a weight of 1, then the following alignment has a total weight of 17 and is an optimal alignment. sre_tirw _rentniv

28 Alphabet-weight edit distance Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms The weight of a substitution depends on exactly which character in the alphabet is being removed and which is being added.

29 String similarity Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms A way of formalizing the relatedness of two strings by measuring their similarity rather than their distance Definition: let Σ be the alphabet used for strings S1 and S2, and let Σ’ be Σ with the added character “_”. Then, for any two characters x, y in Σ’, s( x, y) denotes the value (or score) obtained by aligning x against character y.

30 String similarity Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: for a given alignment A of S 1 and S 2, let S 1 ’ and S 2 ’ denote the strings after the chosen insertion of spaces, and let l denote the (equal) length of the two strings in A. the value of alignment A is defined as Σs(S 1 ’(i), S 2 ’(i)). i=1 l

31 String similarity Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms For example, let Σ={a, b, c, d} and let the pairwise scores be defined in the following matrix: Then the alignment c a c _ d b d c a b b d b _ Has a total value of 0 + 1 – 2 + 3 + 3 – 1 = 4 _dcbas 0-21a 0 -23b -40c 3d 0_

32 String similarity Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: Given a pairwise scoring matrix over the alphabet Σ’, the similarity of two strings S 1 and S 2 is defined as the value of the alignment A of S 1 and S 2 that maximizes total alignment value.

33 Computing similarity Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: V(i,j) is defined as the value of the optimal alignment of prefixes S 1[1..i] and S 2 [1..j] The base conditions are V(0,j)= Σ s ( _, S 2 (k)) V(i,0)= Σ s (S 1 (k), _ ) 1 ≤k ≤j 1 ≤k ≤i

34 Computing similarity Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms For i and j both strictly positive, the general recurrence is V( i, j ) = max[ V(i-1,j-1) + s (S 1 (i), S 2 (j)), V(i-1,j) + s (S 1 (i), _ ), V(i,j-1) + s ( _, S 2 (j)) ] If S1 and S2 are of length n and m, then the value of their optimal alignment (V( n, m)) can be found (using dynamic programming table) in O (nm) time.

35 Alignment graphs for similarity Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms As was the case for edit distance, the computation of similarity can be viewed as a path problem on a directed acyclic graph called an alignment graph. The longest start to destination paths in the alignment graph are in one-to-one correspondence with the optimal (maximum value) alignments.

36 End-space free variant Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Spaces at the end or the beginning of the alignment contribute a weight of zero. Example: shotgun sequence assembly problem. Implementation: using the recurrence for global alignment, but change the base conditions to V(i,0)=V(0,j)=0

37 Approximate occurrences of P in T Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: Given a parameter δ, a substring T’ of T is said to be an approximate occurrence of P if and only if the optimal alignment of P to T’ has value at least δ. Theorem: There is an approximate occurrence of P in T ending at position j of T if and only if V(n,j) ≥ δ. Moreover, T [k.. j] is an approximate occurrence of P in T if and only if V(n,j) ≥ δ and there is a path of backpointers from cell (n,j) to cell(0,k).

38 How to find the optimal alignment in linear space? Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: For any string α, let α r denote the reverse of string α. Definition: Given strings S 1 and S 2, define V r (i,j) as the similarity of the string consisting of the first i characters of S 1 r, and the string consisting of the first j characters of S 2 r. Equivalently, V r (i,j) is the similarity of the last i characters of S 1 and the last j characters of S 2.

39 How to find the optimal alignment in linear space? Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Lemma 1: V(n,m)=max 0≤k≤m [V(n/2,k)+V r (n/2,m-k)]. Definition: Let k* be a position k that maximizes [V(n/2,k)+V r (n/2,m-k)]. Definition: Let L n/2 be the subpath of L that starts with the last node of L in row n/2-1 and ends with the first node of L in row n/2+1.

40 How to find the optimal alignment in linear space? Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Lemma 2: A position k* in row n/2 can be found in O(nm) time and O(m) space. Moreover, a subpath L n/2 can be found and stored in those time and space bounds.

41 How to find the optimal alignment in linear space? Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms A B k 1 k* m k2k2 n/2-1 n/2 n n/2+1

42 How to find the optimal alignment in linear space? Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Execute dynamic programming to compute the optimal alignment of S 1 and S 2,stop after interior n/2. When filling in row n/2, save the normal traceback pointers for the cells in that row. O(m) space Do the same first steps for S 1 r and S 2 r. Using the first set of saved pointers, follow any traceback path from cell (n/2,k*) to a cell k 1 in row n/2-1. (Do the same for k 2 and row n/2+1). O(nm) time and O(m) space is used to find k*, k 1, k 2, and L n/2.

43 Local alignment Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Local alignment problem: given two strings S 1 and S 2, find substrings α and β of S 1 and S 2, respectively, whose similarity (optimal global alignment value) is maximum over all pairs of substrings from S 1 and S 2.

44 Local alignment Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms S 1 =pqraxabcstvq S 2 =xyaxbacsll match = +2 mismatch = - 2 space= -1 optimal local alignment a x a b _ c s a x _ b a c s The optimal local alignment of S 1 and S 2 has value 8 and is defined by substrings axabcs and axbacs

45 Why local alignment? Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Global alignment of protein sequences is often meaningful when the two strings are members of the same protein family. Local alignment is critical when comparing long stretches of anonymous DNA or proteins from very different families.

46 Computing local alignment Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: given a pair of indices i ≤ n and j ≤ m, the local suffix alignment problem is to find a (possibly empty) suffix α of S 1 [1..i] and a (possibly empty) suffix β of S 2 [1..j] such that V(α, β) is the maximum over all pairs of suffixes of S 1 [1..i] and S 2 [1..j].

47 Computing local alignment Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Theorem: let V(i,j) be the value of the optimal local suffix alignment for the given index pair I, j and v* be the value of the optimal local alignment for two strings of length n and m so v*=max [V(i,j): i ≤ n,j ≤ m]

48 Computing local alignment Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Theorem: if i’, j’ is an index pair maximizing V(i,j) over all i, j pairs, then a pair of substrings solving the local suffix alignment for i’, j’ also solves the local alignment problem.

49 How to solve the local suffix alignment problem Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms First, V(i,0)=V(0,j)=0 for all i, j, since we can always choose an empty suffix. Theorem: For i > 0 and j > 0, the proper recurrence for V(i,j) is V( i, j ) = max[ 0,V(i-1,j-1) + s (S 1 (i), S 2 (j)), V(i-1,j) + s (S 1 (i), _ ), V(i,j-1) + s ( _, S 2 (j)) ]

50 Time analysis Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Theorem: For two strings s 1 and s 2 of lengths n and m, the local alignment can be solved in O(nm) time, the same time as for global alignment. Theorem: All optimal local alignments of two strings are represented in the dynamic programming table for V(i,j) and can be found by tracing any pointers back from any cell with value V*.

51 Gaps Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: A gap is any maximal, consecutive run of spaces in a single string of a given alignment. An alignment with seven spaces distributed into four gaps c t t t a a c _ _ a _ a c c _ _ _ c a c c c a t _ c

52 Why gaps? Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms A gap in string S 1 opposite substring α in string S 2 corresponds to either a deletion of α from S 1 or to an insertion of α into S 2. the concept of a gap in an alignment is therefore important in many biological applications because the insertion or deletion on an entire substring (particularly in DNA) often occurs as single mutational event.

53 Choices for gap weights Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms We will examine in detail four general types of gap weights: constant, affine, convex, and arbitrary. The objective in the constant gap weight model is find an alignment A to maximize Σs(S1’(i),S2’(i)) - Wg(# gaps) The objective in the affine gap weight model is find an alignment A to maximize Σs(S1’(i),S2’(i)) -Wg(# gaps) -Ws(# spaces) Ws – the weight given to spaces i=1 l l

54 Choices for gap weights Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Each additional space in a gap contributes less to the gap weight than the preceding space, a gap weight that is a convex, function of its length. Example: Wg +logeq, where q is the length of the gap. The arbitrary gap weight, where the weight of the gap is an arbitrary function w(q) of its length q. the constant, affine, and convex weight models are of course subcases of the arbitrary weight model.

55 Time bounds for gap choices Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Solving the above problems using Dynamic programming Arbitrary gap O(nm²+n²m) Convex gap O(nmlogm) Affine gap O(nm) Constant gap O(nm)

56 Arbitrary gap weights Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms S1 S2 S1 S2 S1 S2 E F G 1 2 3 i j i i j j

57 Arbitrary gap weights Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Definition: define E(i,j) as the maximum value of any alignment of type 1; define F(i,j) as the maximum of any alignment of type 2; define G(i,j) as the maximum value of any alignment of type 3; and finally define V (i,j) as the maximum value of the three terms E(i,j), F(i,j), G(i,j).

58 Arbitrary gap weights Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Recurrences for the case of arbitrary gap weights: V ( i, j ) = max [ E ( i, j ), F ( i, j ), G ( i, j ) ] G ( i, j ) = V ( i – 1, j – 1 ) + s ( S 1 (i), S 2 (j) ) E ( i, j ) = max [ V ( i, k ) – w( j – k ) ] F ( i, j ) = max [ V ( l, j ) – w( i - l ) ] 0 ≤k ≤j-1 0 ≤l ≤i-1

59 Arbitrary gap weights Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Base case if all spaces are included in the objective function: V (i,0) = - w(i) V (0,j) = - w(j) E (i,0) = - w(i) F (0,j) = - w(j) G (0,0) = 0 Base case if end space, and hence end gaps are free: V (i,0) = 0 V (0,j) = 0

60 Time analysis Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Theorem: assuming that |S1| = n and |S2| = m, the recurrences can be evaluated in O( nm² + n²m ) time. Before gaps were included in the model, V(i,j) depended on the three cells adjacent to (i,j) and now we need to look j cells to the left and i cells above to determine V(i,j).

61 Summary Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms The first fact of biological sequence analysis Dynamic Programming:  edit distance  the recurrence relation  tabular computation Optimal alignment in linear space Global alignment Vs. local alignment Gaps

62 Food for thought… Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Repeated substrings: find inexact repeats in a single string. If we do local alignment of a string against itself, the best substring will be the entire string. Even using all the values in the table, the best path may be strongly influenced by the main diagonal.

63 Bibliography Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms Algorithms on strings, trees, and sequences : computer science and computational biology; Gusfield Dan; Cambridge : Cambridge University Press, 1997 Introduction to algorithms; by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest; 2nd edition; Cambridge, MA : MIT Press, 2001; The MIT electrical engineering and computer science series

1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.

Similar presentations

Presentation on theme: "1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.

Similar presentations

Presentation on theme: "1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms."— Presentation transcript:

Similar presentations

About project

Feedback