. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger
2 Sequence Alignment (Reminder) Input: two sequences S 1, S 2 over the same alphabet Output: two sequences S’ 1, S’ 2 of equal length ( S’ 1, S’ 2 are S 1, S 2 with possibly additional gaps) Example: S 1 = GCGCATGGATTGAGCGA S 2 = TGCGCCATTGATGACC u A possible alignment: S’ 1 = -GCGC-ATGGATTGAGCGA S’ 2 = TGCGCCATTGAT-GACC-- Goal: How similar are two sequences S 1 and S 2 Global Alignment:
3 Input: two sequences S 1, S 2 over the same alphabet Output: two sequences S’ 1, S’ 2 of equal length ( S’ 1, S’ 2 are substrings of S 1, S 2 with possibly additional gaps) Example: S 1 = GCGCATGGATTGAGCGA S 2 = TGCGCCATTGATGACC u A possible alignment: S’ 1 = ATTGA-G S’ 2 = ATTGATG Goal: Find the pair of substrings in two input sequences which have the highest similarity Local Alignment: Sequence Alignment (Reminder)
4 -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: Perfect matches Mismatches Insertions & deletions (indel) u Score each position independently u Score of an alignment is sum of position scores
5 Breaking Number Example: M =AAAATTTAAATTTA E =AATTATA M 1 =AAAATTTM 2 =AAATTM 3 =A E 1 = AATTE 2 = ATE 3 =A Find an O(|M||E|) algorithm for finding the breaking number of M,E. Input: Two sequences M,E over the same alphabet ( |M|≥|E| ) Output: The smallest k, s.t. there exist partitions: M=M 1 M 2 … M k, E=E 1 E 2 … E k s.t E i is a substring of M i for all i = 1..k. If no such k exists, then return ∞. AAAATTTAAATTTA --AATT---AT--A
6 Solution: Reduce the problem to global alignment with modifications: u Do not allow mismatches Do not allow gaps in M u No penalty for gaps in start/end of sequence u Constant penalty for gaps (regardless of their length) Scoring scheme: Match – 0 Mismatch - - ∞ Gap intr.- -1 Gap elong.- 0 Breaking Number (cont) Affine gap penalty (d)(e)(d)(e) breaking number = -score of the alignment + 1. AAAATTTAAATTTA --AATT---AT--A
7 Complexity: Standard O(|M||E|) Dynamic Programming Correctness: Two-way argument 1. An alignment of score –( k-1 ) corresponds to a partition of M,E to k subsequences 2. A partition of M,E to k subsequences has an alignment score of –( k-1 ) Optimal alignment has score of - ∞ There is no valid partition (2) Optimal alignment has score –k - There is a valid partition to k+1 blocks (1) - There is no valid partition to less blocks (2) Breaking Number (cont)
8 Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC
9 Multiple Sequence Alignment (cont) Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.
10 Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Multiple Sequence Alignment Example Scoring scheme: match -0 mismatch/indel --1 SP score: =-12
11 Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: Instead of a 2-dimensional table we have a k -dimensional table Each dimension is of length ‘n’+1 Each entry depends on 2 k -1 adjacent entries Complexity: O(2 k n k ) This problem is known to be NP-hard (no polynomial-time algorithm) Multiple Sequence Alignment Complexity
12 Multiple Sequence Alignment Approximation Algorithm We use cost instead of score Find alignment of minimal cost Assumption: the cost function δ is a distance function δ(x,x) = 0 δ(x,y) = δ(y,x) ≥ 0 δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T
13 The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 0.For each i<j calculate D(S i,S j ). 1.Find the string S’ (center) that minimizes 2.Denote S 1 =S’ and the rest of the strings as S 2, …,S k 3.Iteratively add S 2, …,S k to the alignment as follows: a.Suppose S 1, …,S i-1 are already aligned as S’ 1, …,S’ i-1 b.Align S i to S’ 1 to produce S’ i and S’’ 1 aligned c.Adjust S’ 2, …,S’ i-1 by adding spaces where spaces were added to S’’ 1 d.Replace S’ 1 by S’’ 1 Multiple Sequence Alignment Approximation Algorithm
14 Time analysis: Choosing S 1 – execute DP for all sequence-pairs - O(k 2 n 2 ) Adding S i to the alignment - execute DP for S i, S’ 1 - O(i·n 2 ). (In the i th stage the length of S’ 1 can be up-to i · n ) Multiple Sequence Alignment Approximation Algorithm total complexity
15 For all i : d(1,i)≤D(S 1,S i ) (we perform optimal alignment between S’ 1 and S i and δ(-,-) = 0 ) Multiple Sequence Alignment Approximation Algorithm Approximation ratio: M* - optimal alignment M - The alignment produced by this algorithm d(i,j) - the distance M induced on the pair S i,S j
16 Multiple Sequence Alignment Approximation Algorithm Approximation ratio: Definition of S 1 : Triangle inequality