Download presentation
Presentation is loading. Please wait.
1
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng
2
Background Background Definition Definition Hardness Hardness An Exponential time algorithm An Exponential time algorithm
3
Alignments Given two (DNA or Protein) sequences, an alignment puts them against each other such that the similar parts are aligned as close as possible, for example: A T – C – T C G C T - T G - A T G – A T A T – C – T C G C T - T G - A T G – A T There are four kinds of alignments Match Insertion; Deletion; Mismatch
4
Scoring Alignments There are four types of aligned columns: –Match – Score match = 0. –Mismatch – Score mismatch 0. –Insertion – Score insertion 0. –Deletion – Score deletion 0. The score of an alignment is defined to be the sum of the score of the aligned columns. The goal is to minimize the score
5
Gap-cost We can extend the score indel by open and extension, then for a gap of size x, we have open +x* extension instead of x* indel. AT----CGCTTCAT -TGCAT—AT----- AT----CGCTTCAT -TGCAT—AT----- open +4* extension
6
Multiple Alignments In general we also need compare multiple sequences and find the similarities. Multiple alignment generalizes the alignment idea to handle many sequences. AT-C-TCGAT -TGCAT--AT ATCCA-CGCT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT
7
Sum-of-Pairs (SP) Score Given a multiple alignment, the sum-of- pairs (SP) score is given by the sum of the induced pairwise alignment scores of each pair in the alignment. AT-C-TCGAT -TGCAT--AT ATCCA-CGCT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT AT-C-TCGAT -TGCAT--AT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT AT-C-TCGAT -TGCAT--AT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT + +
8
BAD NEWS Multiple alignment is NP-hard One methods is to approximate the optimal value; Progressive alignments A problem arised natually: Aligning Alignments
9
Aligning Alignments Let S be a collection of strings s 1, s 2, s 3 …s k, over alphabet ; An alignment of S is a matrix A with k rows such that: i) Each entry is either a letter or a space; ii) No column is all space; iii) Reading across row i and remove space, we get string s i ; Like before, we have three types of aligning score: match, mismatch and substitution;
10
Aligning Alignments Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B; AT-C-TCGAT -TGCAT--AT ATCCA-CGAT CT-ATTGGAT -TTAT-G--T CTTA-GGGAT
11
Aligning Alignments In other word, We treat the columns of A and B as single letters, just like aligning two sequences. CT GT -T AT -T GT C-T G-T --T -AT --T -GT
12
Aligning Alignments The score function is still sum-of-pair, namely We note that the alignment of A i ’ and B j ’ may contain space in both sequences, so we just remove the space here A i ’: a----aa-a B j ’: aaa-a-a-a
13
Aligning Alignments Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.
14
Aligning Alignments With gap cost, this problem is NP-complete We can use a reduction from MAX-CUT problem MAX-CUT: Given a graph G=(V, E), and a integer c, ask whether there is a partition of V: V= L R and, such that the size of the cut is no less than c; By cut, it means the set of edges which have one end vertex in L and another is in R;
15
NP-hardness Given an instance of MAX-CUT G=(V,E), V={v 1, v 2, …v n } and E={e 1, e 2, … e m },and a integer c; we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;
16
NP-hardness The dummy rows in A are (0-) n, dummy rows in B are (0--) n ; As to the edge rows in A: suppose the row for e, and e=(v i, v j ), then in columns i and j, there are substring, “-1”, and space elsewhere; As to the edge rows in B: suppose the row for e, and e=(v i, v j ), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”
17
NP-hardness Simply we let score for match is 0, score for mismatch is 1, and gap open cost is 2, gap extension cost is 1 ask whether there is an alignment such that the score is less then d-c; So we have an instance of Aligning Alignments.
18
HOMEWORK4 Given a set of multiple alignments {A 1, A 2, … A n }, each A i is a multiple alignment with k i sequences, without gap cost, is the problem of multiple alignment on those alignments {A 1, A 2, … A n } hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.
19
Exact Algorithm The basic idea is still dynamic programming; We have to remember extra information by a set, so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.
20
Exact Algorithm S(i, j)=
21
Exact Algorithm C(i,j,t)=min Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;
22
Exact Algorithm The optimum value is The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is
23
Any Questions? 423B jmeng@cs.tamu.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.