. Sequence Alignment
Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences as strings of letters u DNA & RNA: alphabet of 4 letters u Protein: alphabet of 20 letters
20 Amino Acids u Glycine (G, GLY) u Alanine (A, ALA) u Valine (V, VAL) u Leucine (L, LEU) u Isoleucine (I, ILE) u Phenylalanine (F, PHE) u Proline (P, PRO) u Serine (S, SER) u Threonine (T, THR) u Cysteine (C, CYS) u Methionine (M, MET) u Tryptophan (W, TRP) u Tyrosine (T, TYR) u Asparagine (N, ASN) u Glutamine (Q, GLN) u Aspartic acid (D, ASP) u Glutamic Acid (E, GLU) u Lysine (K, LYS) u Arginine (R, ARG) u Histidine (H, HIS) u START: AUG u STOP: UAA, UAG, UGA
Sequence Comparison u Finding similarity between sequences is important for many biological questions For example: u Find genes/proteins with common origin Allows to predict function & structure u Locate common subsequences in genes/proteins Identify common “motifs” u Locate sequences that might overlap Help in sequence assembly
Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: u Perfect matches u Mismatches u Insertions & deletions (indel)
Choosing Alignments There are many possible alignments For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Which one is better?
Scoring Alignments Rough intuition: u Similar sequences evolved from a common ancestor u Evolution changed the sequences from this ancestral sequence by mutations: Replacements: one letter replaced by another Deletion: deletion of a letter Insertion: insertion of a letter u Scoring of sequence similarity should examine how many operations took place
Simple Scoring Rule Score each position independently: u Match: +1 u Mismatch: -1 u Indel -2 Score of an alignment is sum of positional scores
Example Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23
More General Scores u The choice of +1,-1, and -2 scores was quite arbitrary u Depending on the context, some changes are more plausible than others Exchange of an amino-acid by one with similar properties (size, charge, etc.) vs. Exchange of an amino-acid by one with opposite properties
For proteins
Additive Scoring Rules u We define a scoring function by specifying a function (x,y) is the score of replacing x by y (x,-) is the score of deleting x (-,x) is the score of inserting x u The score of an alignment is the sum of position scores
Edit Distance u The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other u Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance
Computing Edit Distance u How can we compute the edit distance?? If | s | = n and | t | = m, there are more than alignments u The additive form of the score allows to perform dynamic programming to compute edit distance efficiently
Recursive Argument Define the notation: Using the recursive argument, we get the following recurrence for V :
Recursive Argument u Of course, we also need to handle the base cases in the recursion:
Dynamic Programming Algorithm We fill the matrix using the recurrence rule
Dynamic Programming Algorithm Conclusion: d( AAAC, AGC ) = -1
Reconstructing the Best Alignment u To reconstruct the best alignment, we record which case in the recursive rule maximized the score
Reconstructing the Best Alignment u We now trace back the path the corresponds to the best alignment AAAC AG-C
Reconstructing the Best Alignment u Sometimes, more than one alignment has the best score AAAC A-GC
Local Alignment Consider now a different question: Can we find similar substring of s and t Formally, given s[1..n] and t[1..m] find i,j,k, and l such that d(s[i..j],t[k..l]) is maximal
Local Alignment u As before, we use dynamic programming We now want to set V[i,j] to record the best alignment of a suffix of s[1..i] and a suffix of t[1..j] u How should we change the recurrence rule?
Local Alignment New option: u We can start a new match instead of extend previous alignment Alignment of empty suffixes
Local Alignment Example s = TAATA t = ATCTAA
Local Alignment Example s = TAATA t = TACTAA
Local Alignment Example s = TAATA t = TACTAA
Local Alignment Example s = TAATA t = TACTAA
Sequence Alignment We seen two variants of sequence alignment: u Global alignment u Local alignment Other variants: u Finding best overlap (exercise) All are based on the same basic idea of dynamic programming