Download presentation
1
Multiple Sequence Alignment
CS 5263 & CS 4593 Bioinformatics Multiple Sequence Alignment
2
Multiple Sequence Alignment
Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that All sequences have the same length L Score of the alignment is maximum Two issues How to score an alignment? How to find a (nearly) optimal alignment?
3
Scoring function - first assumption
Columns are independent Similar in pair-wise alignment Therefore, the score of an alignment is the sum of all columns Need to decide how to score a single column
4
Scoring function (cont’d)
Ideally: An n-dimensional matrix, where n is the number of sequences E.g. (A, C, C, G, -) for aligning 5 sequences Total number of parameters: (k+1)n, where k is the alphabet size Direct estimation of such scores is difficult Too many parameters to estimate Even more difficult if need to consider phylogenetic relationships x y Phylogenetic tree or evolution tree ? z w v
5
Scoring Function (cont’d)
Compromises: Compute from pair-wise scores Option 1: Based on sum of all pair-wise scores Option 2: Based on scores with a consensus sequence Other options Consider tree topology explicitly Information-theory based score Difficult to optimize
6
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -
7
Sum Of Pairs (cont’d) The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)
8
Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG A C G T - 1 -1
(A,A) + (A,G) x 2 = -1 (G,G) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) (-2) (-2) (-1) + (-1) = 5
9
Sum Of Pairs (cont’d) Drawback: no evolutionary characterization
Every sequence derived from all others Heuristic way to incorporate evolution tree Weighted Sum of Pairs: S(m) = k<l wkl s(mk, ml) wkl: weight decreasing with distance Human Mouse Duck Chicken
10
Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Consensus sequence: Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)
11
Multiple Sequence Alignments Algorithms
Can also be global or local We only talk about global for now A simple method Do pairwise alignment between all pairs Combine the pairwise alignments into a single multiple alignment Is this going to work?
12
Compatible pairwise alignments
AAAATTTT AAAATTTT---- ----TTTTGGGG AAAATTTT---- AAAA----GGGG AAAATTTT---- ----TTTTGGGG AAAA----GGGG TTTTGGGG AAAAGGGG ----TTTTGGGG AAAA----GGGG
13
Incompatible pairwise alignments
AAAATTTT AAAATTTT---- ----TTTTGGGG ----AAAATTTT GGGGAAAA---- ? TTTTGGGG GGGGAAAA TTTTGGGG---- ----GGGGAAAA
14
Multidimensional Dynamic Programming (MDP)
Generalization of Needleman-Wunsh: Find the longest path in a high-dimensional cube As opposed to a two-dimensional grid Uses a N-dimensional matrix As apposed to a two-dimensional array Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))
15
Multidimensional Dynamic Programming (MDP)
Example: in 3D (three sequences): 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i,j-1,k-1) (i,j,k-1) (i,j-1,k) (i,j,k)
16
Multidimensional Dynamic Programming (MDP)
Running Time: Size of matrix: LN; Where L = length of each sequence N = number of sequences Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)
17
Faster MDP Carrillo & Lipman, 1988 Implemented in a tool called MSA
Branch and bound Other heuristics Implemented in a tool called MSA Practical for about 6 sequences of length about
18
Faster MDP Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) = k<l s(mk, ml) k<l s(k, l) lower-bounded: score computed by any approximate algorithm (such as the ones we’ll talk next) For any partial path, if Scurrent + Sperspective < lower-bound, can give up that path Guarantees optimality Optimal msa Score of the alignment between k and l induced by m Score of optimal alignment between k and l
19
Progressive Alignment
Multiple Alignment is NP-hard Most used heuristic: Progressive Alignment Algorithm: Align two of the sequences xi, xj Fix that alignment Align a third sequence xk to the alignment xi,xj Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times
20
Progressive Alignment
x y z w When evolutionary tree is known: Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw)
21
Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment Algorithm: Find all dij: alignment dist (xi, xj) High alignment score => short distance Construct a tree (similar to hierarchical clustering. Will discuss in future) Align nodes in order of decreasing similarity + a large number of heuristics
22
CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD
23
CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s2 s3 s4 9 4 7 8 3
9 4 7 8 3 Distance matrix
24
CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s1 s2 s3 s4 9 4 7
9 4 7 8 3 s3 s2 s4
25
CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s1 s2 s3 s4 9 4 7
9 4 7 8 3 s3 s2 s4
26
CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s1 s2 s3 s4 9 4 7
9 4 7 8 3 s3 s2 s4
27
CLUSTALW example S1 ALSK S2 TNSD S3 NASK S4 NTSD -ALSK -TNSD NA-SK
9 4 7 8 3 s3 s2 s4
28
Iterative Refinement Problems with progressive alignment:
Depend on pair-wise alignments If sequences are very distantly related, much higher likelihood of errors Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT
29
Iterative Refinement Algorithm (Barton-Stenberg):
Align most similar xi, xj Align xk most similar to (xixj) Repeat 2 until (x1…xN) are aligned For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN Repeat 4 until convergence Progressive alignment
30
Iterative Refinement (cont’d)
For each sequence y Remove y Realign y (while rest fixed) z x allow y to vary y x,z fixed projection Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations
31
Iterative Refinement Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: y: G-ACTTA + 3 matches
32
Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA
z: GAACTGA w: GTACTGA Realigning any single yi changes nothing
33
Restricted MDP Similar to bounded DP in pair-wise alignment
Construct progressive multiple alignment m Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)
34
Restricted MDP Within radius 1 of the optimal
x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA Within radius 1 of the optimal Restricted MDP will fix it.
35
Other approaches Statistical learning methods
Profile Hidden Markov Models Consistency-based methods Still rely on pairwise alignment But consider a third seq when aligning two seqs If block A in seq x aligns to block B in seq y, and both aligns to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable Essentially: change scoring system according to consistency Then apply DP as in other approaches Pioneered by a tool called T-Coffee
36
Multiple alignment tools
Clustal W (Thompson, 1994) Most popular T-Coffee (Notredame, 2000) Another popular tool Consistency-based Slower than clustalW, but generally more accurate for more distantly related sequences MUSCLE (Edgar, 2004) Iterative refinement More efficient than most others DIALIGN (Morgenstern, 1998, 1999, 2005) “local” Align-m (Walle, 2004) PROBCONS (Do, 2004) Probabilistic consistency-based Best accuracy on benchmarks ProDA (Phuong, 2006) Allow repeated and shuffled regions
37
In summary Multiple alignment scoring functions
Sum of pairs Other funcs exist, but less used Multiple alignment algorithms: MDP Optimal too slow Branch & Bound doesn’t solve the problem entirely Progressive alignment: clustalW Iterative refinement Restricted MDP Consistency-based Heuristic
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.