Sequence Alignment Tutorial #2 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger .
Sequence Comparison Much of bioinformatics involves sequences DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: |alphabet|=4 Protein: |alphabet|=20
Global Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Global Alignment -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Example (cont): -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: Perfect matches Mismatches Insertions & deletions (indel) Best biological explanaiton Biological data Hypotheses space Symmetric view of evolution
Global Alignment scoring scheme Score each position independently: Match: +1 Mismatch: -1 Indel: -2 Score of an alignment is sum of position scores Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23
Sequence Alignment Variants Two basic variants of sequence alignment: Global alignment (The Needelman-Wunsch Algorithm) Local alignment (The Smith-Waterman Algorithm) Today we’ll see : Overlap alignment Affine cost for gaps We’ll use ideas of dynamic programming presented in the lecture
Overlap Alignment Consider the following problem: Find the most significant overlap between two sequences S,T ? Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences.
Overlap Alignment Formally: given S[1..n] , T[1..m] find i,j such that: d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) } is maximal. Solution: Same as Global alignment except we don’t not penalise overhanging ends.
Overlap Alignment Initialization: V[i,0]=0 , V[0,j]=0 Recurrence: as in global alignment Score: maximum value at the bottom line and rightmost line global local overlap
Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : Match: +4 Mismatch: -1 Indel: -5
Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : Match: +4 Mismatch: -1 Indel: -5
Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme: Match: +4 Mismatch: -1 Indel: -5
Overlap Alignment (Example) The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE- Scoring scheme : Match: +4 Mismatch: -1 Indel: -5 -2
Affine gap scores Observation: Insertions and deletions often occur in blocks longer than a single nucleotide. Consequence: Current scoring scheme gives a constant penalty per gap unit. This does not score well the above phenomenon. Question: How do we modify the scheme to incorporate this?
Alignment with affine gap scores Penalty score for a gap of length g : d - penalty for introduction of a gap e - penalty for elongating the gap by one unit. Typically d > e Problem: When aligning S[i] to a gap we do not know how much to penalize. d or e ? Solution: we compute 3 matrices simultaneously M(i,j) - the score obtained by aligning S[i] to T[j] IS(i,j) - the score obtained by aligning S[i] to a gap IT(i,j) - the score obtained by aligning T[j] to a gap
Affine gap scores Initialization: depending on the problem (global, local,…) Recurrence: uses already known values - M(i’,j’), IS(i’,j’), IT(i’,j’) M(i-1,j-1) M(i-1,j) IS(i-1,j-1) IS(i-1,j) IT(i-1,j-1) IT(i-1,j) M(i,j-1) IS(i,j-1) IT(i,j-1) We assume that a deletion will not be followed directly by an insertion. This can be obtained by using
Why are two matrices enough? Affine gap scores Simplification: Why are two matrices enough?