. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau
2 Multiple Sequence Alignment Reminder S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC
3 Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it. Multiple Sequence Alignment Reminder
4 The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 1.Find the string S 1 (center) that minimizes 2.Iteratively add S 2, …,S k to the alignment Finds MA costing at most twice the optimal cost! Multiple Sequence Alignment Reminder Problem: Conventional MA does not model correctly evolutionary relationships
5 Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. How do we label internal vertices? Sequences Profiles (multiple alignments) Tree Alignment
6 A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table. Column i holds the distribution of Σ (and gap) in that position Profile Alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- A T G C : 3
7 Aligning a sequence to a profile: Matching letter to position: weighted average of scores Indels: introducing new columns gets special consideration (same goes for aligning two profiles) Profile Alignment A T G C : 3
8 Iteratively constructs MA for intermediate nodes At each point holds profiles for all leaves Chooses closest pair of neighbors - neighbors – have common father in T - distance - cost of optimal (pairwise) alignment Aligns the two profiles to get the ‘father-profile’ Replaces the two leaves with their father Analysis: Initialization – O(k 2 ) alignments k-1 iterations Iteration i involves k-i-1 new pairwise alignments Clustal Algorithm ClustalW – more advanced version. Sequences/profiles are weighted
9 Lifted Tree Alignments Lifted tree alignment – each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 We’ll show: 1. DP algorithm for optimal lifted tree alignment 2. Optimal lifted alignment is 2-approximation of optimal tree alignment
10 Lifted Tree Alignments Algorithm Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X : d(v,S) - the optimal cost of v ’s subtree when it is labeled by S The cost of optimal tree is S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5
11 Lifted Tree Alignments Algorithm d(v,S) - the optimal cost of v ’s subtree when it is labeled by S Initialization: for leaf v labeled S v - Recurrence: for internal node v with daughters u 1,…u l - Correctness: check for suboptimal solution property Complexity: O(k 2 ) pairwise alignments - O(n 2 k 2 ). k-1 iterations For internal node v - O(k v 2 ) work Total: O(k 2 (n 2 +depth(T))) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 O(k 2 depth(T))=O(k 3 )
12 Lifted Tree Alignments Approximation analysis Claim: Optimal LTA 2-approximates general tree alignments We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes (? can be generalized for profile-labeled nodes ?) Notations: T* - optimal TA labels S v * - label of node v in T* T L – our constructed LTA S v L - label of node v in T L S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5
13 Lifted Tree Alignments Approximation analysis Construction: We label the nodes bottom-up. For node v with daughters u 1,…u l – we choose the label (from S u1 L,…,S u l L ) closest to S v * We need to show: D(T L ) ≤ 2D(T*) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5
14 Lifted Tree Alignments Approximation analysis Analysis: Some edges in T L have cost 0 Observe edges (v,u) of cost > 0: S i - label of father( v ) S j - label of daughter ( u ) P(v,u) – the path in T* from v to the leaf labeled by S j D(S i,S j ) ≤ D(S i,S v *) + D(S j,S v *) ≤ 2D(S j,S v *) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 triangle inequality choice of i triangle inequality
15 Lifted Tree Alignments Approximation analysis D(S i,S j ) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 If (u,v) and (u’,v’) are two different edges with cost > 0 in T L, then P(u,v) and P(u’,v’) are mutually disjoint in edges Final Remarks: Lifted tree alignment T L is only conceptual (we don’t have T* ) Optimal LTA cannot cost more than T L In case of profile-labeled nodes: construction and analysis OK when cost is still distance function Q.E.D.