Download presentation
Presentation is loading. Please wait.
1
Needleman-Wunsch with affine gaps
Gap scores: g(g)=-d-(g-1)e where d=2, e=1 Precedence: M, Ix, Iy PAM 250 A C D 2 -2 12 -5 4 Align the sequences: CA and DC
2
Multiple sequence alignment
Biology 162 Computational Genetics Todd Vision 2 Sep 2004
3
Preview How to score a multiple alignment
Sum of pairs scores Weighting Generalizing pairwise alignment algorithms Full dynamic programming Carillo-Lipman Practical methods Progressive Iterative Stochastic Probabilistic Some final thoughts
4
Multiple sequence alignment (MSA)
5
Mind the gaps Trivial Difficult
6
Natural score Tree score
Even with a known tree, finding an MSA to optimize the tree score is NP-hard A SAE B E SBD SDE D SCD C
7
Star-tree scores Assume an unresolved phylogeny Sum-of-pairs ()
Entropy Consistency Weighs agreement with external evidence
8
Entropy as used in a sequence logo
9
SP scores: pros and cons
Easy, intuitive, work OK Cons Substitution scores based on pairs of residues Inconsistent behavior with k One mismatch matters more when k is large than when k is small Gap penalties undefined for s(-,-)
10
Natural gap penalties Gap costs in multiple alignment should be equal to sum of gap costs in induced pairwise alignments Computationally prohibitive to compute for most algorithms Instead, quasi-natural gap costs are computed They are almost always identical
11
Weighted SP scores Scores are not independent due to (unaccounted for) shared ancestry To correct this, sum-of-pairs scores from related sequences can be down-weighted Variety of weighting schemes exist Tree-based weighting is simplest Assign weights proportional to sum of branch lengths on a phylogenetic tree Obviously requires a tree (but we have an approximate tree in some algorithms)
12
Full dynamic programming
We have k sequences of length n Recursion equations are similar to pairwise case We can use a simple extension of pairwise scoring As before, we can guarantee an optimal alignment The problem is we must fill out a k-dimensional hypercube Time and space grow exponentially in k At least O(k22knk) Computationally prohibitive even for a moderate number of short sequences
13
Carillo-Lipman algorithm
Reduce volume of hypercube that is searched Upper bound on score Score of optimal MSA is less than or equal to sum of scores of optimal pairwise alignments Lower bound on score Score of optimal MSA must be greater or equal to score of heuristic MSA Projections in each dimension defined by optimal pairwise alignments and induced heuristic alignments Optimum path is bounded by projections in all dimensions
14
Carillo-Lipman algorithm
15
Carillo-Lipman algorithm
Only works for SP scoring function Implemented in MSA software Can still only tackle small cases (eg 15 sequences of length 300)
16
Practical global alignment methods
Progressive Uses a guide tree to reduce the problem to multiple pairwise alignments Iterative Initialized with a fast multiple alignment Sequences are randomly partitioned and pairwise aligned until convergence Stochastic Genetic algorithms as an example Probabilistic Hidden Markov models
17
Progressive Alignment
Fast, but no guarantee of finding the optimum Implementations: Feng-Doolittle, ClustalW, Pileup Steps Compute all k(k-1)/2 pairwise alignments Use alignment scores to construct guide tree Perform pairwise alignments beginning at the leaves of the guide tree and working toward the root
18
Pairwise score matrix S12 S13 S14 S15 S23 S24 S25 S34 S35 S45
Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sequence 5 S12 S13 S14 S15 S23 S24 S25 S34 S35 S45
19
Guide Tree Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sequence 5 2 4
20
New Problem How to align a sequence to an alignment?
Or two alignments to each other? Feng-Doolittle solution Choose highest scoring pair of sequences between the two groups to guide the alignment ClustalW solution Profile alignment: compute generalized sum of pairs score
21
Profiles Profile II Profile I 1 2 3 4 1 2 3 4 ---------- ----------
a w w w w pos c w w w w g w w w w t w w w w S Profile I a w w w w pos c w w w w g w w w w t w w w w S
22
ClustalW- ad hoc improvements
Variable substitution matrix Encourage gaps preferentially in structural loops Residue-specific gap penalties Reduced penalties in hydrophilic regions Reduced gap penalties in positions already containing gaps Increased gap opening penalties in flanking sequence of gap
23
Progressive alignment: major weakness
Errors introduced in the alignment of subgroups are propagated through all subsequent steps There is no provision for correcting such errors once they happen Local optimum versus global optimum
24
Iterative alignment Again capitalizes on the ease of pairwise alignment between groups of sequences Allows for gaps to be removed and positions to be shifted in each iteration Some algorithms guarantee convergence given long enough Can be several orders of magnitude slower than progressive methods Most successful implementation: PRRN
25
Iterative alignment ACGATAGACAT ACG-TACAGAT ACGATAGACAT ACGATAGACAT
CGA-TAGAGAC CGA-TACAGAC ACGATAGACAT ACG-TACAGAT -CGATAGAGAC -CGATACAGAC CGA-TAGAGAC CGA-TACAGAC
26
T-COFFEE Uses consistency as an objective function
Evaluates consistency with pairs of residues found in optimal local alignments and heuristic global alignment The consistency function can also incorporate extraneous information (such as structural constraints) Among the most successful of approaches when % identity is moderate to good
27
Dialign A multiple local alignment algorithm
Informally, it works by chaining together ungapped segments from dotplots Does not explicitly score gaps at all May contain unaligned regions flanked by aligned regions
28
Stochastic methods Genetic algorithms (eg SAGA)
Initalize with population of heuristic alignments Evaluate ‘fitness’ of individual alignments Can employ computationally intensive scoring functions Create new generation of alignments Select parents according to fitness ‘Cross-over’ attributes of parents Apply mutation to perturb progeny alignments Return to ‘evaluate fitness’ step Stopping rule
29
Probabilistic methods
Hidden Markov Models Models that generate MSAs Many parameters to fit Probability of each residue in each column Probability of entering gap states between columns Perform poorly on unaligned sequences But are commonly used in signature databases Perform well for finding matches to already aligned sequences Efficient algorithms exist for aligning sequences to HMMs
30
Hidden Markov model
31
How do you know when you’ve got the right answer?
Short answer: you don’t. Structural superposition typically used to evaluate methodologies BAliBASE: database of curated reference alignments
32
Comparison of test and reference alignments
Modified SP score Frequency with which pairs of residues aligned in test are aligned in reference Column score Frequency with which entire columns of residues are aligned in both test and reference
33
Be skeptical! MSA is a hard problem
Computationally Biologically There is no ‘one size fits all’ algorithm No two algorithms need agree
34
The future of MSA Chances are your new sequence matches something already in the database It may soon be a rarity to generate an MSA from scratch Signature databases currently allow local alignment of a query to a pre-existing local multiple alignment (eg InterProScan)
35
Summary Challenges in MSA How MSA is achieved in practice
Even bounded dynamic programming is impractical Appropriate scoring is not obvious How MSA is achieved in practice Fastest Progressive pairwise alignment Slower Iterative alignment Stochastic alignment Automated MSAs require manual scrutiny
36
Reading Assignment Pertsemlidis A, Fondon JW (2002) Having a BLAST with bioinformatics (and avoiding BLASTphemy), 10 pgs.
37
Reading Assignment Gusfield D (1997) pgs in Algorithms on Strings, Trees and Sequences. Durbin et al. (1998) pgs. 36, in Biological Sequence Analysis.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.