Multiple Sequence Alignment

Multiple Sequence Alignment
S1=AGGTC Possible alignment A - T G C S2=GTTCG S3=TGAAC Possible alignment A G - T C

Multiple Sequence Alignment (cont)
Input: Sequences S1 , S2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length |S’1|= |S’2|=…= |S’k| Removal of spaces from S’i obtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence Alignment Example
Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -3 -5 -4 =-12

Multiple Sequence Alignment Complexity
Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: Instead of a 2-dimensional table we have a k-dimensional table Each dimension is of length ‘n’+1 Each entry depends on 2k-1 adjacent entries Complexity: O(2knk) This problem is known to be NP-hard (no polynomial-time algorithm)

Multiple Sequence Alignment Approximation Algorithm
We use cost instead of score  Find alignment of minimal cost Assumption: the cost function δ is a distance function δ(x,x) = 0 δ(x,y) = δ(y,x) ≥ 0 δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T

The ‘star’ algorithm: Input: Γ - set of k strings S1, …,Sk. Find the string S’ (center) that minimizes Denote S1=S’ and the rest of the strings as S2, …,Sk Iteratively add S2, …,Sk to the alignment as follows: Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1 Align Si to S’1 to produce S’i and S’’1 aligned Adjust S’2, …,S’i-1 by adding spaces where spaces were added to S’’1 Replace S’1 by S’’1

Time analysis: Choosing S1 – execute DP for all sequence-pairs - O(k2n2) Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2). (In the ith stage the length of S’1 can be up-to i· n) total complexity

Approximation ratio: M* - optimal alignment M - The alignment produced by this algorithm d(i,j) - the distance M induces on the pair Si,Sj For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Triangle inequality Approximation ratio: Definition of S1:

Multiple Sequence Alignment Reminder
S1=AGGTC Possible alignment A - T G C S2=GTTCG S3=TGAAC Possible alignment A G - T C

Input: Sequences S1 , S2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length |S’1|= |S’2|=…= |S’k| Removal of spaces from S’i obtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

The ‘star’ algorithm: Input: Γ - set of k strings S1, …,Sk. Find the string S1 (center) that minimizes Iteratively add S2, …,Sk to the alignment Finds MA costing at most twice the optimal cost! Problem: Conventional MA does not model correctly evolutionary relationships

Tree Alignment Input: X - set of sequences
T – phylogenetic tree on X (leaves labeled by X) Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. How do we label internal vertices? Sequences Profiles (multiple alignments)

: 3 Profile Alignment A T G C -
A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table. Column i holds the distribution of Σ (and gap) in that position A - T G C A 1 T 2 G 3 C - : 3

(same goes for aligning two profiles)
Profile Alignment Aligning a sequence to a profile: Matching letter to position: weighted average of scores Indels: introducing new columns gets special consideration (same goes for aligning two profiles) A 1 T 2 G 3 C - : 3

Clustal Algorithm Iteratively constructs MA for intermediate nodes
At each point holds profiles for all leaves Chooses closest pair of neighbors neighbors – have common father in T distance - cost of optimal (pairwise) alignment Aligns the two profiles to get the ‘father-profile’ Replaces the two leaves with their father Analysis: Initialization – O(k2) alignments k-1 iterations Iteration i involves k-i-1 new pairwise alignments ClustalW – more advanced version. Sequences/profiles are weighted

Lifted Tree Alignments
each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: We’ll show: DP algorithm for optimal lifted tree alignment Optimal lifted alignment is 2-approximation of optimal tree alignment S1 S2 S3 S4 S6 S5

Lifted Tree Alignments Algorithm
Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X) Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X: d(v,S) - the optimal cost of v’s subtree when it is labeled by S The cost of optimal tree is

Lifted Tree Alignments Algorithm
d(v,S) - the optimal cost of v’s subtree when it is labeled by S Initialization: for leaf v labeled Sv - Recurrence: for internal node v with daughters u1,…ul - Correctness: check for suboptimal solution property Complexity: O(k2) pairwise alignments - O(n2k2) . k-1 iterations For internal node v - O(kv2) work Total: O(k2(n2+depth(T))) O(k2depth(T))=O(k3)

Lifted Tree Alignments Approximation analysis
Claim: Optimal LTA 2-approximates general tree alignments We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes (? can be generalized for profile-labeled nodes ?) Notations: T* - optimal TA labels Sv* - label of node v in T* TL – our constructed LTA SvL - label of node v in TL

Construction: We label the nodes bottom-up. For node v with daughters u1,…ul – we choose the label (from Su1L ,…,SulL) closest to Sv* We need to show: D(TL) ≤ 2D(T*)

Some edges in TL have cost 0 Observe edges (v,u) of cost > 0: Si- label of father(v) Sj- label of daughter (u) P(v,u) – the path in T* from v to the leaf labeled by Sj D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u)) triangle inequality choice of i triangle inequality

D(Si,Sj) ≤ 2D(P(v,u)) If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges Q.E.D. Final Remarks: Lifted tree alignment TL is only conceptual (we don’t have T*) Optimal LTA cannot cost more than TL In case of profile-labeled nodes: construction and analysis OK when cost is still distance function

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback