Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.

Similar presentations


Presentation on theme: ". Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences."— Presentation transcript:

1 . Sequence Alignment

2 Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences as strings of letters u DNA & RNA: alphabet of 4 letters u Protein: alphabet of 20 letters

3 20 Amino Acids u Glycine (G, GLY) u Alanine (A, ALA) u Valine (V, VAL) u Leucine (L, LEU) u Isoleucine (I, ILE) u Phenylalanine (F, PHE) u Proline (P, PRO) u Serine (S, SER) u Threonine (T, THR) u Cysteine (C, CYS) u Methionine (M, MET) u Tryptophan (W, TRP) u Tyrosine (T, TYR) u Asparagine (N, ASN) u Glutamine (Q, GLN) u Aspartic acid (D, ASP) u Glutamic Acid (E, GLU) u Lysine (K, LYS) u Arginine (R, ARG) u Histidine (H, HIS) u START: AUG u STOP: UAA, UAG, UGA

4 Sequence Comparison u Finding similarity between sequences is important for many biological questions For example: u Find genes/proteins with common origin  Allows to predict function & structure u Locate common subsequences in genes/proteins  Identify common “motifs” u Locate sequences that might overlap  Help in sequence assembly

5 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

6 Alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: u Perfect matches u Mismatches u Insertions & deletions (indel)

7 Choosing Alignments There are many possible alignments For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Which one is better?

8 Scoring Alignments Rough intuition: u Similar sequences evolved from a common ancestor u Evolution changed the sequences from this ancestral sequence by mutations:  Replacements: one letter replaced by another  Deletion: deletion of a letter  Insertion: insertion of a letter u Scoring of sequence similarity should examine how many operations took place

9 Simple Scoring Rule Score each position independently: u Match: +1 u Mismatch: -1 u Indel -2 Score of an alignment is sum of positional scores

10 Example Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23

11 More General Scores u The choice of +1,-1, and -2 scores was quite arbitrary u Depending on the context, some changes are more plausible than others  Exchange of an amino-acid by one with similar properties (size, charge, etc.) vs.  Exchange of an amino-acid by one with opposite properties

12 For proteins

13 Additive Scoring Rules u We define a scoring function by specifying a function  (x,y) is the score of replacing x by y  (x,-) is the score of deleting x  (-,x) is the score of inserting x u The score of an alignment is the sum of position scores

14 Edit Distance u The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other u Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance

15 Computing Edit Distance u How can we compute the edit distance??  If | s | = n and | t | = m, there are more than alignments u The additive form of the score allows to perform dynamic programming to compute edit distance efficiently

16 Recursive Argument Define the notation:  Using the recursive argument, we get the following recurrence for V :

17 Recursive Argument u Of course, we also need to handle the base cases in the recursion:

18 Dynamic Programming Algorithm We fill the matrix using the recurrence rule

19 Dynamic Programming Algorithm Conclusion: d( AAAC, AGC ) = -1

20 Reconstructing the Best Alignment u To reconstruct the best alignment, we record which case in the recursive rule maximized the score

21 Reconstructing the Best Alignment u We now trace back the path the corresponds to the best alignment AAAC AG-C

22 Reconstructing the Best Alignment u Sometimes, more than one alignment has the best score AAAC A-GC

23 Local Alignment Consider now a different question:  Can we find similar substring of s and t  Formally, given s[1..n] and t[1..m] find i,j,k, and l such that d(s[i..j],t[k..l]) is maximal

24 Local Alignment u As before, we use dynamic programming  We now want to set V[i,j] to record the best alignment of a suffix of s[1..i] and a suffix of t[1..j] u How should we change the recurrence rule?

25 Local Alignment New option: u We can start a new match instead of extend previous alignment Alignment of empty suffixes

26 Local Alignment Example s = TAATA t = ATCTAA

27 Local Alignment Example s = TAATA t = TACTAA

28 Local Alignment Example s = TAATA t = TACTAA

29 Local Alignment Example s = TAATA t = TACTAA

30 Sequence Alignment We seen two variants of sequence alignment: u Global alignment u Local alignment Other variants: u Finding best overlap (exercise) All are based on the same basic idea of dynamic programming


Download ppt ". Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences."

Similar presentations


Ads by Google