Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Similar presentations


Presentation on theme: "Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG."— Presentation transcript:

1 Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

2 Multiple Sequence Alignment (cont) Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

3 Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Multiple Sequence Alignment Example Scoring scheme: match -0 mismatch/indel --1 SP score: -3-5 -4 =-12

4 Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: Instead of a 2-dimensional table we have a k -dimensional table Each dimension is of length ‘n’+1 Each entry depends on 2 k -1 adjacent entries Complexity: O(2 k n k ) This problem is known to be NP-hard (no polynomial-time algorithm) Multiple Sequence Alignment Complexity

5 Multiple Sequence Alignment Approximation Algorithm We use cost instead of score  Find alignment of minimal cost Assumption: the cost function δ is a distance function δ(x,x) = 0 δ(x,y) = δ(y,x) ≥ 0 δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T

6 The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 1.Find the string S’ (center) that minimizes 2.Denote S 1 =S’ and the rest of the strings as S 2, …,S k 3.Iteratively add S 2, …,S k to the alignment as follows: a.Suppose S 1, …,S i-1 are already aligned as S’ 1, …,S’ i-1 b.Align S i to S’ 1 to produce S’ i and S’’ 1 aligned c.Adjust S’ 2, …,S’ i-1 by adding spaces where spaces were added to S’’ 1 d.Replace S’ 1 by S’’ 1 Multiple Sequence Alignment Approximation Algorithm

7 Time analysis: Choosing S 1 – execute DP for all sequence-pairs - O(k 2 n 2 ) Adding S i to the alignment - execute DP for S i, S’ 1 - O(i·n 2 ). (In the i th stage the length of S’ 1 can be up-to i · n ) Multiple Sequence Alignment Approximation Algorithm total complexity

8 For all i : d(1,i)=D(S 1,S i ) (we perform optimal alignment between S’ 1 and S i and δ(-,-) = 0 ) Multiple Sequence Alignment Approximation Algorithm Approximation ratio: M* - optimal alignment M - The alignment produced by this algorithm d(i,j) - the distance M induces on the pair S i,S j

9 Multiple Sequence Alignment Approximation Algorithm Approximation ratio: Definition of S 1 : Triangle inequality

10 Multiple Sequence Alignment Reminder S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

11 Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it. Multiple Sequence Alignment Reminder

12 The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 1.Find the string S 1 (center) that minimizes 2.Iteratively add S 2, …,S k to the alignment Finds MA costing at most twice the optimal cost! Multiple Sequence Alignment Reminder Problem: Conventional MA does not model correctly evolutionary relationships

13 Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. How do we label internal vertices? Sequences Profiles (multiple alignments) Tree Alignment

14 A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table. Column i holds the distribution of Σ (and gap) in that position Profile Alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- A 1001100 T 1002100 G 0310001 C 0000030 - 1020102 : 3

15 Aligning a sequence to a profile: Matching letter to position: weighted average of scores Indels: introducing new columns gets special consideration (same goes for aligning two profiles) Profile Alignment A 1001100 T 1002100 G 0310001 C 0000030 - 1020102 : 3

16 Iteratively constructs MA for intermediate nodes At each point holds profiles for all leaves Chooses closest pair of neighbors - neighbors – have common father in T - distance - cost of optimal (pairwise) alignment Aligns the two profiles to get the ‘father-profile’ Replaces the two leaves with their father Analysis: Initialization – O(k 2 ) alignments k-1 iterations Iteration i involves k-i-1 new pairwise alignments Clustal Algorithm ClustalW – more advanced version. Sequences/profiles are weighted

17 Lifted Tree Alignments Lifted tree alignment – each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 We’ll show: 1. DP algorithm for optimal lifted tree alignment 2. Optimal lifted alignment is 2-approximation of optimal tree alignment

18 Lifted Tree Alignments Algorithm Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X : d(v,S) - the optimal cost of v ’s subtree when it is labeled by S The cost of optimal tree is S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

19 Lifted Tree Alignments Algorithm d(v,S) - the optimal cost of v ’s subtree when it is labeled by S Initialization: for leaf v labeled S v - Recurrence: for internal node v with daughters u 1,…u l - Correctness: check for suboptimal solution property Complexity: O(k 2 ) pairwise alignments - O(n 2 k 2 ). k-1 iterations For internal node v - O(k v 2 ) work Total: O(k 2 (n 2 +depth(T))) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 O(k 2 depth(T))=O(k 3 )

20 Lifted Tree Alignments Approximation analysis Claim: Optimal LTA 2-approximates general tree alignments We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes (? can be generalized for profile-labeled nodes ?) Notations: T* - optimal TA labels S v * - label of node v in T* T L – our constructed LTA S v L - label of node v in T L S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

21 Lifted Tree Alignments Approximation analysis Construction: We label the nodes bottom-up. For node v with daughters u 1,…u l – we choose the label (from S u1 L,…,S u l L ) closest to S v * We need to show: D(T L ) ≤ 2D(T*) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

22 Lifted Tree Alignments Approximation analysis Analysis: Some edges in T L have cost 0 Observe edges (v,u) of cost > 0: S i - label of father( v ) S j - label of daughter ( u ) P(v,u) – the path in T* from v to the leaf labeled by S j D(S i,S j ) ≤ D(S i,S v *) + D(S j,S v *) ≤ 2D(S j,S v *) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 triangle inequality choice of i triangle inequality

23 Lifted Tree Alignments Approximation analysis D(S i,S j ) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 If (u,v) and (u’,v’) are two different edges with cost > 0 in T L, then P(u,v) and P(u’,v’) are mutually disjoint in edges Final Remarks: Lifted tree alignment T L is only conceptual (we don’t have T* ) Optimal LTA cannot cost more than T L In case of profile-labeled nodes: construction and analysis OK when cost is still distance function Q.E.D.


Download ppt "Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG."

Similar presentations


Ads by Google