Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations
Multiple Sequence Alignment Motivation What are we trying to accomplish? Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations
Multiple Sequence Alignment Motivation Representation of protein families Identification and representation of conserved features of DNA/protein sequences that correlate with structure or function Deduction of evolutionary history from DNA/protein sequences Read pages 333-342 A lot of this is done by “heuristic” or “intuition” and is difficult to automate
Biological Motivation Previous “First Fact of Biological Sequence Comparison” In biomolecular sequences (DNA, RNA, amino acid sequences), high sequence similarity usually implies significant functional or structural similarity Second Fact of Biological Sequence Comparison Evolutionarily and functionally related molecular strings can differ significantly throughout the string yet preserve the same 3D structure(s), 2D substructure(s), active sites, or dispersed residues
2 strings versus multiple strings Based on first fact Find unknown biological relationships using string similarity Method: database searching Multiple strings Based loosely on second fact Given known biological relationships (function, structure, etc), identify unknown conserved subpatterns in a set of strings These subpatterns can then be used as a known pattern for other database searches
Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations
Definition A global alignment of a set of k>2 strings {Si} is obtained by inserting spaces (dashes) into each Si so that each string has the same length at the end. Placing each string into columns, one character (or dash) per column. Note ALL positions in both S and T are involved A local alignment of a set of k>2 strings {Si} is obtained by selecting one substring Si’ from each string Si globally aligning those substrings
Example Strings {abca, ababa, accb, cbbc} a b c - a a b a b a
Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Induced pairwise alignments Definition of sum of pair (SP) scoring Justification (or lack thereof) Algorithms Family Representations
Scoring MSAs Key fact: there is no universally accepted score function My impression is that people evaluate MSA’s by feel (they know a good one when they see it) Definitions Given a MSA M, the induced pairwise alignment of Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.
Definitions Definitions Given a MSA M, the induced pairwise alignment of Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired. The score of an induced pairwise alignment is determined using any chosen scoring scheme for two-string alignment in the standard manner.
Example Example Induced alignment Score a b c - a a b a b a a c c b - c b - b c Induced alignment a b c - a a c c b - Score 0 1 0 1 1 = 3
Sum of Pairs (SP) Definition: The SP score of a MSA M is the sum of the scores of pairwise global alignments induced by M Example a b c - a a b a b a a c c b - c b - b c SP score: 2 + 3 + 4 + 3 + 3 + 4 = 19
Justification Difficult to give a sound biological justification for SP or any other scoring scheme Main reasons for studying it It is easy to work with It has been used by many people in studying MSA It is used in several packages
Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Exact, NP-hard problem Approximation Algorithm (Center Star) Heuristic Methods Family Representations
Formal Problem Input Output Observation k strings {Si} Scoring function Output MSA of {Si} with minimum (maximum) SP score Observation Exact solution is NP-hard Dynamic programming takes O(nk) time, so solving exactly for more than even 6 strings of typical length is often not feasible
Heuristic Speedup View problem as a shortest path problem with O(nk) nodes Given an upper bound on the actual value, we can eliminate exploration of many nodes using branch and bound ideas Key is to send values forward rather than backwards Backwards: All nodes will eventually be evaluated Forwards: Limit to those which can possibly be less than current estimate on optimal
Backwards D(i,j) w r i t e s 1 2 3 4 5 6 7 v n
Forwards D(i,j) w r i t e s 1 2 3 4 5 6 7 v n
Approximation Algorithms Given the hardness of computing the exact solution, how about developing algorithms that compute a solution that is guaranteed to be close to optimal Goal: Find a polynomial-time algorithm A that minimizes supI A(I)/OPT(I) Only computer scientists seem interested in this Biologists seem to do things more heuristically
Alignments consistent with a tree D(Si,Sj) is the optimal weighted edit distance between Si and Sj Definition: Let T be a tree where each node is labeled with a string from {Si}. Then a multiple alignment of {Si} is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si, Sj) that are connected by an edge in T.
Example -AX-Z -A-YZ -AXYZ --XYZ AYXYZ All edge alignment scores are optimal Others are not such as AYXYZ with -AXYZ AXYZ XYZ AYXYZ
Theorem For any {Si} and any tree T whose nodes are labeled with distinct nodes of {Si}, we can efficiently find an MSA M(T) of {Si} that is consistent with T. Proof Incrementally align any two adjacent nodes Two aligned gaps have zero cost Add gaps as necessary to other already aligned sequences
Example Align AXYZ and XYZ Align AYXYZ and -XYZ … AYZ AXZ AXYZ -XYZ A-XYZ or -AXYZ --XYZ --XYZ AYXYX AYXYZ … AXYZ XYZ AYXYZ
Triangle Inequality Assume an alphabet-weighted scoring scheme s(x,y) x and y could be any character (or a space) A scoring scheme satisfies the triangle inequality if for any three characters (including a space) x, y, and z, s(x,z) <= s(x,y) + s(y,z) Note, not all scoring schemes used in biology satisfy this triangle inequality property
Center Star Method For {Si}, define Sc to be the string that minimizes Sall strings D(Sc, Sj) Define the center star to be the star where the center node is labeled with Sc Define Mc to be an MSA of {Si} that is consistent with the center star Define d(Si, Sj) to be the score of the pairwise alignment of Si and Sj induced by Mc. Denote the score of an alignment M as d(M).
Example AYZ AXZ Sall strings D(AXYZ, Sj) = 4 Mc before AYXYZ added Mc after AYXYZ added A-XYZ A-X-Z A--YZ --XYZ AYXYZ AXYZ XYZ AYXYZ
Example continued Mc after AYXYZ added d(AYZ,AYXYZ) = 2 AXZ AYZ Mc after AYXYZ added A-XYZ A-X-Z A--YZ --XYZ AYXYZ d(AYZ,AYXYZ) = 2 d(Mc) = 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + 2 + 2 = 16 AXYZ XYZ AYXYZ
Results Lemma: Assuming triangle inequality, then d(Si, Sj) <= d(Si, Sc) + d(Sc, Sj) = D(Si, Sc) + D(Sc, Sj) Definition: Let M* be the optimal alignment of {Si} and d*(Si, Sj) be the score of the induced pairwise alignment. Theorem: d(Mc) / d(M*) <= 2(k-1)/k < 2
Proof
Weighted SP Each induced pairwise score is multiplied by a weight w(i,j). Optimal weighted SP can be computed in exponential time (in k) using dynamic programming Little is known about approximation of weighted SP Why doesn’t center star give a guaranteed bound here?
Heuristic Techniques In practice, people tend to use more heuristic methods with no proven performance guarantees Basic idea Do some form of iterative or progressive alignment For example, do an alignment based on a minimum spanning tree of some sort Find two closest nodes and join them how should we define closeness? then iteratively add closest non-aligned node to the alignment
Heuristic Techniques In practice, people tend to use more heuristic methods with no proven performance guarantees Basic idea Do some form of iterative or progressive alignment For example, do an alignment based on a minimum spanning tree of some sort Find two closest nodes and join them how should we define closeness? then iteratively add closest non-aligned node to the alignment
One method of defining closeness sd(i,j) scores given a scoring scheme Compute D(Si, Sj) 100 times do “Jumble” Si and Sj and compute D(jum(Si), jum(Sj)) Compute mean and standard deviation of these 100 jumbled comparisons Define sd(i,j) = D(Si, Sj)/standard deviation (no mean?) Intuition Strings Si and Sj contain non-random structures (hopefully secondary structure) in common if sd(i,j) is high
Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations Profiles Regular expressions/motifs
Representation Problem Input family of sequences that typically have a known biological similarity Desired output Representation of this family of sequences that reveals any string/sequence similarities that hopefully are related to their biological similarity
Profiles Strings {abca, ababa, accb, cbbc} Profile 1 2 3 4 5 a b c - a 1 2 3 4 5 a 75 25 50 b 75 75 c 25 25 50 25 - 25 25 25
Log odds ratio p(a) = 6/20 = 30% p(a,1) = 3/4 = 75% Strings {abca, ababa, accb, cbbc} a b c - a a b a b a a c c b - c b - b c Profile 1 2 3 4 5 a 75 25 50 b 75 75 c 25 25 50 25 - 25 25 25 p(a) = 6/20 = 30% p(a,1) = 3/4 = 75% log (p(x,j)/p(x)) is entry Example (without logs) 1 2 3 4 5 a 2.5 0 .83 0 1.7 b 0 2.5 0 2.5 0 c 1 1 2 0 1 - 0 0 1.7 1.7 1.7
Nice feature of profiles Natural extension of alignment and scoring of strings to profiles Aligning a string to a profile We can generalize notions of pairwise string alignment Scoring Compute a weighted sum based on frequency of characters in the column Can generalize to profile to profile alignments Optimal alignment Dynamic programming can solve
Signature representations Signature or motif signature pattern contained as a substring in most members of a family typically represented as a regular expression Such a regular expression might be derived given a multiple sequence alignment