Lecture 5: Local Sequence Alignment Algorithms

Lecture 5: Local Sequence Alignment Algorithms
CS 5263 Bioinformatics Lecture 5: Local Sequence Alignment Algorithms

Poll Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?

Roadmap Review of last lecture Local Sequence Alignment
Statistics of sequence alignment Substitution matrix Significance of alignment

Bounded Dynamic Programming
x1 ………………………… xM O(kM) time O(kM) memory Possibly O(M+k) yN ………………………… y1 k

Linear-space alignment
O(M+N) memory 2MN time M/2 k* M/2 N-k*

Graph representation of seq alignment
-1 -2 -3 -4 1 2 (0,0) -1 -1 -1 -1 -1 -1 1 1 1 1 (3,4) An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph

Question If I change the scoring scheme, will it change the alignment?
Match = 1, mismatch = gap = -2 || v Match = 2, mismatch = gap = -1? Answer: Yes

Proof Let F1 be the score of an optimal alignment under the scoring scheme Match = m > 0 Mismatch = s < 0 Gap = d < 0 Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment F1 = a1m + b1s + c1d

Proof (cont’) Let F2 be the score of a sub-optimal alignment under the same scoring scheme Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment F2 = a2m + b2s + c2d Let F1 = F2 + k, where k > 0

Proof (cont’) Now we change the scoring scheme, so that Match = m + 1
Mismatch = s + 1 Gap = d + 1

Proof (cont’) The new scores for the two alignments become:
F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1) = a1m + b1s + c1d + (a1+b1+c1) = F1 + (a1+b1+c1) = F1 + L1 F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1) = F2 + (a2+b2+c2) = F2 + L2 length of alignment 1 length of alignment 2

Proof (cont’) F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2)
= k + (a1+b1+c1) – (a2+b2+c2) = k + L1 – L2 In order for F1’ < F2’, we need to have: k + L1 – L2 < 0, i.e. L2 – L1 > k Length of alignment 1 Length of alignment 2

Proof (cont’) This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme. We only need to show one example that it is possible to find such two alignments

d F1 = 2m + 3s F2 = 3m + 4d m m s d m d m s s d

F1 = 2m + 3s F2 = 3m + 4d m = 1, s = d = –2 F1 = 2 – 6 = –4
F1 > F2 m d m s s d

F1 = 2m + 3s F2 = 3m + 4d m = 2, s = d = – 1 F1’ = 4 – 3 = 1
F2’ > F1’ m d m s s d

A A C A G AACAG | | ATCGT F1 = 2x1-3x2 = -4 F1’ = 2x2 – 3x1 = 1 m A m T m C AA-CAG- | | | -ATC-GT F2 = 3x1 – 4x2 = -5 F2’ = 3x2 – 4x1 = 2 m G T

On the other hand, if we had doubled our scores, such that
m’ = 2m, s’ = 2s d’ = 2d F1’ = 2F1 F2’ = 2F2 Our alignment won’t be changed

Today How to model gaps more accurately? Local sequence alignment
Statistics of alignment

What’s a better alignment?
GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d However, gaps usually occur in bunches. During evolution, chunks of DNA may be lost entirely Aligning genomic sequence vs. cDNA (reverse complimentary to mRNA)

Model gaps more accurately
Current model: Gap of length n incurs penalty nd General: Convex function E.g. (n) = c * sqrt (n)  n  n

General gap dynamic programming
Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O(N2M) (cubic) Space: O(NM) (linear-space algorithm not applicable)

Compromise: affine gaps
(n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: 5 Gap extension: 1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1

Additional states The amount of state needed increases
In scoring a single entry in our matrix, we need remember an extra piece of information Are we continuing a gap in x? (if no, start is more expensive) Are we continuing a gap in y? (if no, start is more expensive) Are we continuing from a match between xi and yj?

Finite State Automaton
Xi aligned to a gap  d Xi and Yj aligned  d Yj aligned to a gap e 

Dynamic programming We encode this information in three different matrices For each element (i,j) we use three variables F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1)
d F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) – d Ix(i, j) = max Iy(i, j – 1) – d Ix(i, j – 1) – e F(i – 1, j) – d Iy(i, j) = max Ix(i – 1, j) – d Iy(i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

F Ix Iy

If we stack all three matrices
No cyclic dependency We can fill in all three matrices in order

Algorithm for i = 1:m F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))
for j = 1:n Fill in F(i, j), Ix(i, j), Iy(i, j) end F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) Time: O(MN) Space: O(MN) or O(N) when combine with the linear-space algorithm

To simplify F(i – 1, j – 1) + (xi, yj) F(i, j) = max
I(i – 1, j – 1) + (xi, yj) F(i, j – 1) – d I (i, j) = max I(i, j – 1) – e F(i – 1, j) – d I(i – 1, j) – e I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap This is possible because no alternating gaps allowed

To summarize Global alignment Basic algorithm: Needleman-Wunsch
Variants: Overlapping detection Longest common subsequences Achieved by varying initial conditions or scoring Bounded DP (pruning search space) Linear space (divide-and-conquer) Affine gap penalty

Local alignment

The local alignment problem
Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

Why local alignment Conserved regions may be a small part of the whole
“Active site” of a protein Scattered genes or exons among “junks” Don’t have whole sequence Global alignment might miss them if flanking “junk” outweighs similar regions

Genes are shuffled between genomes
C D B D A C A B C D B D C A

Naïve algorithm for all substrings X’ of X and Y’ of Y
Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Reminder The overlap detection algorithm
We do not give penalty to gaps in the ends Free gap Free gap

Similar here We are free of penalty for the unaligned regions

The big idea Whenever we get to some bad region (negative score), we ignore the previous alignment Reset score to zero

The Smith-Waterman algorithm
Initialization: F(0, j) = F(i, 0) = 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

The Smith-Waterman algorithm
Termination: If we want the best local alignment… FOPT = maxi,j F(i, j) If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

The correctness of the algorithm can be proved by induction using the alignment graph
-10 100

x c d e a b Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 3 Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 3 5 Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1

Trace back x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1

Trace back x c d e a b 2 1 3 5 4 cxde | || c-de x-de | || xcde
a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde

No negative values in local alignment DP array
Optimal local alignment will never have a gap on either end Local alignment: “Smith-Waterman” Global alignment: “Needleman-Wunsch”

Analysis Time: Memory: O(MN) for finding the best alignment
Depending on the number of sub-opt alignments Memory: O(MN) O(M+N) possible

The statistics of alignment
Where does (xi, yj) come from? Are two aligned sequences actually related?

Probabilistic model of alignments
We’ll focus on protein alignments without gaps Given an alignment, we can consider two possibilities R: the sequences are related by evolution U: the sequences are unrelated How can we distinguish these possibilities? How is this view related to amino-acid substitution matrix?

Model for unrelated sequences
Assume each position of the alignment is independently sampled from some distribution of amino acids ps: probability of amino acid s in the sequences Probability of seeing an amino acid s aligned to an amino acid t by chance is Pr(s, t | U) = ps * pt Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is

Model for related sequences
Assume each pair of aligned amino acids evolved from a common ancestor Let qst be the probability that amino acid s in one sequence is related to t in another sequence The probability of an alignment of x and y is give by

Probabilistic model of Alignments
How can we decide which possibility (U or R) is more likely? One principled way is to consider the relative likelihood of the two possibilities (the odd ratios) A higher ratio means that R is more likely than U

Log odds ratio Taking the log, we get
Recall that the score of an alignment is given by

Therefore, if we define We are actually defining the alignment score as the log odds ratio (log likelihood) between the two models R and U This is indeed how biologists have defined the substitution matrices for proteins

ps can be counted from the available protein sequences
But how do we get qst? (the probability that s and t have a common ancestor) Counted from trusted alignments of related sequences

Protein Substitution Matrices
Two popular sets of matrices for protein sequences PAM matrices [Dayhoff et al, 1978] Better for aligning closely related sequences BLOSUM matrices [Henikoff & Henikoff, 1992] For both closely or remotely related sequences

Positive for chemically similar substitution
Common amino acids get low weights Rare amino acids get high weights

BLOSUM-N matrices Constructed from a database called BLOCKS
Contain many closely related sequences Conserved amino acids may be over-counted N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity identity: % of matched columns Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

If you want to detect homologous genes with high identify, you may want a BLOSUM matrix with higher N. say BLOSUM75 On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 BLOSUM62 is the standard

For DNAs No database of trusted alignments to start with
Specify the percentage identity you would like to detect You can then get the substitution matrix by some calculation

For example Suppose pA = pC = pT = pG = 0.25 We want 88% identity
qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 (s, t) = log (0.01 / (0.25*0.25)) = for s ≠ t.

Substitution matrix A C G T 1.26 -1.83

A C G T 5 -7 Scale won’t change the alignment Multiply by 4 and then round off to get integers

Arbitrary substitution matrix
Say you have a substitution matrix provided by someone It’s important to know what you are actually looking for when you use the matrix

What’s the difference? Which one should I use? A C G T 1 -2 A C G T 5
-4 What’s the difference? Which one should I use?

We had Scale it, so that Reorganize:

Since all probabilities must sum to 1,
We have Suppose again ps = 0.25 for any s We know (s, t) from the substitution matrix We can solve the equation for λ Plug λ into to get qst

A C G T 1 -2 A C G T 5 -4 Translate: 95% identity
 = 1.33 qst = 0.24 for s = t, and for s ≠ t Translate: 95% identity  = 1.21 qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity

Lecture 5: Local Sequence Alignment Algorithms

Similar presentations

Presentation on theme: "Lecture 5: Local Sequence Alignment Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 5: Local Sequence Alignment Algorithms

Similar presentations

Presentation on theme: "Lecture 5: Local Sequence Alignment Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback