UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.1-5: Multiple String.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.1-5: Multiple String Comparisons Lecturer: Dr. Rose Slides by: Dr. Rose March 4, 2003

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons Q: Why are we interesting in multiple string comparisons? A: At one level we are data-mining. 1.Looking for similarities a)Common evolution b)Common functionality 2.Significance of similarity may not be clear with only two strings. Multiple string comparison is accomplished by multiple alignment.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons Defn. Global multiple alignment of k > 2 strings is: 1.Generalization of alignment of 2 strings 2.Strings S 1,S 2,…,S k are inflated with spaces to achieve strings S’ 1,S’ 2,…,S’ k with uniform length l. 3.Strings are arrayed in k rows of l columns.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Example AGT..CTT.ACGCG AGTAGCTT...GCG..TAGC.T..GGCG.CTA.C.TAACCCG ACTA...TAAC...

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Example

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons Consider the relation between two-string comparison and biological function: –two-string alignments are used to find unsuspected biological relationship from apparent string similarity. –This follows from the first fact of biological sequence comparison: sequence similarity implies functional or structural similarity.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons Consider the relation between multiple string comparison and biological function: –Multiple string alignments are used to find unknown string similarities from known biological relationships. –This isn’t as obvious since there is the tendency to focus on one-dimensional sequences and not the corresponding three-dimensional structures or two- dimensional substructures.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons This follows from the second fact of biological sequences: Strings that are functionally related can appear very different and yet preserve the same important three- dimensional and two-dimensional features. There are several levels of abstraction entailed: 1.Three-dimensional structure 2.Functionality 3.Amino-acid sequence

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons These different levels of abstraction are preserved/conserved to different degrees: 1.Three-dimensional structure is most preserved 2.Functionality is somewhat conserved 3.Amino-acid sequence less likely to be conserved Q: What point are we trying to make? A: The significance is that similarity of structure may not be blatantly apparent at the sequence level.  Comparison of multiple sequences highlights less apparent similarity.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons Example from text: Hemoglobin –4 chains of ~140 amino acids a piece –Found in insects to mammals –Insects and invertebrates diverged ~600 million BP  large number of amino acid mutations (~100) per chain in the two sequences (insect & invertebrate)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons Comparison of two mammalian hemoglobin sequences:  Exhibit high amino-acid similarity (Our cousin the chimpanzee shares the identical sequence)  Suggest similar functionality Comparison of mammalian and insect hemoglobin sequences:  Exhibits little amino-acid similarity  However, has similar functionality

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Multiple String Comparisons The important point is that while: sequence similarity  functional & structural similarity The converse: functional & structural similarity  sequence similarity is not true, i.e., functional & structural similarity  sequence similarity

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Family & Superfamily Representation Data Mining Problem: –Given a set of biologically similar strings  find the commonalities that characterize the family. Why would we want to do this? –Conserved features may explain function & structure. –Characterization of the family may make it easy to recognize new members. –Characterization may also make it easier to exclude nonmembers.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Family & Superfamily Representation Example: protein families –The similarity may be functionality or –Two- or three-dimensional structure Specific Examples: –globins (hemoglobins, myoglobins) –immunoglobulin (antibody) proteins

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Family & Superfamily Representation Q: Why would we be interested in identifying the family to which a protein belongs? A: Family membership immediately clues us in on: 1.Physical structure 2.Biological functionality Text suggests there are ~100,000 proteins in humans but only ~1000 or fewer protein families

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Family & Superfamily Representation Q: If we suspect that a new protein belongs to some family how do we check? 1.Align the new protein sequence with a representative member of the family? 2.Align the new protein sequence with several representative members of the family? 3.Align the new protein sequence with a generalization of members of the family? A: Align the new protein sequence with a generalization of members of the family.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Family & Superfamily Representation Q: What is the representation of the generalization of members of the family? Consider: 1.We want to match family members while 2.Excluding non-family members This is an established area in machine learning. In general, the key is that the representation language must be sufficiently expressive to distinguish between + & - examples. Conjecture: amino acid strings lack sufficient expressiveness

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Family & Superfamily Representation Three common currently used representations: 1.Profile (based on multiple alignment) 2.Consensus sequence (based on multiple alignment) 3.Signature (some based on multiple alignment, some not)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation Defn. a profile (aka weight matrix)for a multiple alignment specifies the frequency of each character in each column. Consider the following multiple alignment: a b c – a a b a b a a c c b – c b – b c The corresponding extracted profile C1 C2 C3 C4 C5 a.75.25.50 b.75.75 c.25.25.50.25 -.25.25.25

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation log-odds ratios: profile entries are sometimes expressed in this form. Let p(y, j) denote the frequency of the occurrence of character y in column j. Let p(y) denote the frequency of the occurrence of character y anywhere in multiply aligned sequences. log p(y, j)/p(y) is the log-odds ratio for cell (y, j) of the profile (weight matrix).

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation Alignment of string S with profile P –Insertion of spaces into S is allowed –Use regular string alignment? –Let C be a string of profile column positions –Align S by inserting spaces into S and C.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation Example: S = aabbc, P is the profile from the previous slide: C1 C2 C3 C4 C5 a.75.25.50 b.75.75 c.25.25.50.25 -.25.25.25 Alignment of S and C. S : a a b - b c C : 1 - 2 3 4 5 Q: How do we score such an alignment???

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation Q: How do we score profile alignments? 1.Assume we have an alphabet-weight scoring scheme, e.g., a b c - a 2 –1 -3 -1 b –1 2 –1 -1 c –3 –1 2 -1 - -1 –1 –1 0 2.Column score: compute the weighted sum of scores based on the frequency of characters in the column. 3.Alignment score: sum the column scores.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation a b c - : alphabet-weight scoring scheme a 2 –1 -3 -1 b –1 2 –1 -1 c –3 –1 2 -1 - -1 –1 –1 0 C1 C2 C3 C4 C5 : profile a.75.25.50 b.75.75 c.25.25.50.25 -.25.25.25 Compute the weighted sum of scores based on the frequency of characters in the column. S : a a b - b c C : 1 - 2 3 4 5 Column1 = 0.75 * 2 + 0.25*(-3) Column2 = 0.75 * 2 + 0.25*(-1) Column3 = 0.25 * 0 + 0.50 * (-1) + 0.25 * (-1) Column4 = 0.75 * 2 + 0.25 * (-1) Column5 = 0.50 * (-3) + 0.25 * 2 + 0.25 * (-1)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation Q: How do we find optimal alignments? A: Use dynamic programming to maximize similarity. As before: s(x, y) denotes the alphabet-weight assignment for aligning x & y. p(y, j) denote the frequency of letter y in column j. Then let S(x, j) denote  y [s(x, y) * p(y, j) ], the score for aligning x with column j.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation Defn. Let V(i, j) denote the value of the optimal alignment of S[1..i] with the first j columns of C. Then V(0, j ) =  k  j S(_,k) And V(i, 0) =  k  i s(S 1 (k), _) Here S 1 (k) denotes the k th character of the first string argument, i.e., S[k].

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation The general recurrence is then: V(i, j) = max[ V(i - 1, j - 1) + S(S 1 (i), j), match i th and j th letters V(i - 1, j) + s(S 1 (i), _), insert a gap in the profile V(i, j - 1) + S(_,j) ] insert a gap in S 1. Q: What is the time complexity for solving this recurrence using DP?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Profile Representation Clearly the time complexity is O(  mn) for DP Where: 1. n is the length of S the string. 2. m is length of the profile and  is the size of the alphabet. O(  mn) is more costly than sequence to sequence alignment. (Do you recall what that cost was?)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Signature Representation This representation is used by protein databases such as: PROSITE BLOCKS The core idea is that families of proteins are characterized by motifs or sequence signatures. Q: What is a motif? A: (Webster) A usu. repeating salient thematic element

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Signature Representation Example from text: [&H][&A]D[DE]x n [TSN] x 4 [QK]G x 7 [&A] Where 1. A bracket indicates alternative amino acids 2. & = { I, L, V, M, F, Y, W} 3. x denotes any amino acid. 4. The subscript denote the length of the string, n denotes and arbitrary length.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Signature Representation Example from text: [&H][&A]D[DE]x n [TSN] x 4 [QK]G x 7 [&A] Observations: 1.The representation is a generalization 2.The generalization is a regular expression

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Signature Representation Signature: [&H][&A]D[DE]x n [TSN]x 4 [QK]Gx 7 [&A] Matches: HADDITIIIIQGIIIIIIIA IADDITIIIIQGIIIIIIIA LADDITIIIIQGIIIIIIIA VADDITIIIIQGIIIIIIIA MADDITIIIIQGIIIIIIIA

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Signature Representation Regular expression representation  use regular expression pattern matching.  no need to worry about mismatches/errors.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Multiple Alignments Recall: two string local alignment was defined in terms of global alignment of substrings.  We take the same approach for multiple string local alignment. Defn. A local multiple alignment of a set S of strings is obtained by selecting one substring S´ i from each S i  S and then globally aligning these substrings.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Multiple Alignments Q: Global vs Local alignment: which should we prefer? Wait for someone to respond! Gusfield notes for: –Pairs of sequences and –Multiple sequences there are biological justifications for preferring local over global alignment of multiple sequences. But…….

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Multiple Alignments But……. The best (computer science) theoretical results are for global alignment. Like the joke about the lost wallet, Gusfield chooses to emphasize global alignment.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Multiple Alignments Q: How can we generalize the concept of score to multiple alignments? IOW, what objective function should we use? We will consider three types of objective functions: 1.Sum-of-pairs 2.Consensus 3.Tree

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Multiple Alignments First we define the concepts of induced pairwise alignment and its corresponding score. Defn. The induced pairwise alignment of strings S i and S j is obtained from the global alignment M by removing all other rows. Note: instances of matching spaces can be removed from the induced alignment. Note: to score an induced pairwise alignment any two-string alignment scoring scheme can be used.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Multiple Alignments Consider the following pairwise scoring scheme score = #mismatches + #spaces In the following example: 1 A A T - G G T T T 2A A - C G T T A T 3T A T C G - A A T score(1,2) = 4 score(1,3) = 5 score(2,3) = 4

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.1-5: Multiple String.

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.1-5: Multiple String."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.1-5: Multiple String.

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.1-5: Multiple String."— Presentation transcript:

Similar presentations

About project

Feedback