Computational Biology Lecture #6: Matching and Alignment Bud Mishra Professor of Computer Science and Mathematics 10 ¦ 22 ¦ 2001 12/1/2018 ©Bud Mishra, 2001
Inexact Matching ©Bud Mishra, 2001 Example: Edit Distance Problem: Edit distance between two biological sequences May correspond to: Evolutionary Distance Functional Distance Structural Distance 12/1/2018 ©Bud Mishra, 2001
Edit Distance ©Bud Mishra, 2001 Simplest distance function corresponds to: EDIT DISTANCE Atomic Edit Functions: Insertion AATCGG a AATACGG Deletion AATACGG a AATCGG Substitution AATCGG a AATAGG A composite edit function ' Function Composition of Atomic Edit Functions 12/1/2018 ©Bud Mishra, 2001
Cost of a Composite Edit Function (Based on the cost or distance for Atomic Edit Functions) Given: Two strings S1 and S2 Distance(S1, S2) = min { cost(E) | E(S1) = S2} where E = composite edit function mapping S1 to S2. 12/1/2018 ©Bud Mishra, 2001
Some Properties of Distance Function Assume: (8 e= Atomic Edit Function) cost(e) = cost(e-1) Distance(S1, S2) = Distance(S2, S1) Symmetric Distance(S1, S1) = 0 Distance(S1, S2) + Distance(S2, S3) 5 Distance(S1, S3) Triangle Inequality Simplest Cost Function: Each atomic edit function is of unit cost. 12/1/2018 ©Bud Mishra, 2001
Edit Operations ©Bud Mishra, 2001 I: Insertion of a character into the first string S1 D: Deletion of a character from the first string S1 R: Replacement (or Substitution) of a character in the first string S1 with a character in the second string S2 M: Matching (Identity) 12/1/2018 ©Bud Mishra, 2001
Edit Transcript ©Bud Mishra, 2001 Example: GATT A CA | | | | | The complete edit function is described by an “edit transcript” EDIT TRANSCRIPT = s 2 {D, M, R, I}* Example (in left): Edit transcript = DDMIRIMMII Edit Distance = 1+1+0+1+1+1+0+0+1+1=7 GATT A CA | | | | | TTCGGCATT D 1 2 MM I 3 R 4 5 6 7 12/1/2018 ©Bud Mishra, 2001
Levenshtein Distance or Edit Distance Edit Distance between two strings S1 and S2 is defined as the minimum number of atomic edit operations – insertions, deletions (indels), and substitutions – needed to transform the first string into the second Optimal Transcript = An edit transcript corresponding to the minimum number of atomic edit operations of unit cost. EDP The Edit Distance Problem is to compute the edit distance between two given strings, along with an optimal edit transcript that describes the transformation (alignment) 12/1/2018 ©Bud Mishra, 2001
Dynamic Programming Calculation of Edit Distance Define: D(i,j) ´ Min number of atomic edit operations needed to transform the first i characters of S1 into the first j characters of S2 ´ EditDistance(S1[1..i], S2[1..j]) |S1| = n |S2| = m Distance(S1, S2) = D(n,m) Dynamic Programming: 3 components: Recurrence Relation Tabular Computation Traceback 12/1/2018 ©Bud Mishra, 2001
Recurrence ©Bud Mishra, 2001 D(i-1,j-1) D(i-1,j) D(i,j-1) D(i,j) Base Relation: D(0,0) = 0 EditDistance(l,l) = 0 Recurrence Relations: In 1 coordinate: D(i,0) = D(i-1,0)+1 (S1[i] deleted) D(0,j) = D(0,j-1)+1 (S2[j] inserted) EditDistance(S1[1..i], l) =i (i deletions) EditDistance(l, S2[1..j]) =j (j insertions) In both coordiates: D(i,j) = min {D(i-1,j) +1, (S1[i] deleted) {D(i,j-1)+1, (S2[j] inserted) {D(i-1,j-1)+ D(i,j)} (substn or match) D(i,j) = 1, if S1[i] ¹ S2[j]; 0 otherwise. D(i-1,j-1) D(i-1,j) D(i,j-1) D(i,j) D(i,j) +1 +1 12/1/2018 ©Bud Mishra, 2001
Efficient Tabular Computation of Edit Distance Recursive Implementation a 2O(n+m)-time computation Bottom-up computation (n+1)£ (m+1) distinct values for D(i,j) to be computed Dynamic Programming Table of size (n+1)£ (m+1) String S_1 corresponds to the rows (Vertical Axis) String S_2 corresponds to the columns (Horizontal Axis) Fill out D(i,0) Ã First Column Fill out D(0,j) Ã First Row Fill out rows D(i,j) Ã Left-to-Right (increasing i) 12/1/2018 ©Bud Mishra, 2001
The Algorithm ©Bud Mishra, 2001 Time complexity = O(nm) for i=0 to n do D(i,0) Ã i; for j=0 to m do D(0,j) Ã j; for i=1 to n do for j=1 to m do D(i,j) Ã min[ D(i-1,j)+1, D(i,j-1)+1, D(i-1,j-1) + D(i,j)] ¤ Time complexity = O(nm) 12/1/2018 ©Bud Mishra, 2001
Example ©Bud Mishra, 2001 S2 ! S1 ! A T C G Solutions with minimal edit distance =4 (sol.1) T A-GGC AATCGG 1 2 3 4 (Edit Script= RMIRMR) (sol.2) - -TAGGC AATCGG - 1 2 3 4 (Edit Script = IIMRMMD) A T C G 1 2 3 4 5 6 S1 ! 12/1/2018 ©Bud Mishra, 2001
Trace Back ©Bud Mishra, 2001 Extracting Optimal Edit Transcript: Set a pointer from: Cell(i,j) ! Cell(i,j-1), if D(i,j) = D(i,j-1)+1 Horizontal Edge ) I, Insertion Cell(i,j) ! Cell(i-1,j), if D(i,j) = D(i-1,j)+1 Vertical Edge ) D, Deletion Cell(i,j) ! Cell(i-1,j-1), if D(i,j) = D(i-1,j-1)+D(i,j) Diagonal Edge ) R, Substitution, if D(i,j)=1 M. Match, if D(i,j) = 0. Optimal Edit Transcript can be computed in O(n+m) additional time. 12/1/2018 ©Bud Mishra, 2001
GAPS: The Scoring Model Basic operations: Sequencing Errors or Evolutionary processes of Mutations and Selections Substitution: Changes one base to another. Gaps: Insertions or Deletions: Adds or removes a base. Total Score Assigned to an Alignment= Sum of terms for each aligned pair of bases plus terms for each gap. 12/1/2018 ©Bud Mishra, 2001
Total Score of an Alignment with Gaps Total Score Assigned to an Alignment Corresponds to log of the Relative likelihood that the two sequences are related compared to being unrelated. Assumptions: Mutations or Sequencing Errors at different sites in a sequence occur independently. 12/1/2018 ©Bud Mishra, 2001
Substitution Matrices Notation: x and y ´ Pairs of sequences, |x| = n and |y| = m. x, y 2 (A+G+C+T)* xi = ith symbol in x yj = jth symbol in y Random Model, R: P(x,y|R) = Õ qxi Õ qyj qa = probability that the letter “a” occurs independently at a given site. 12/1/2018 ©Bud Mishra, 2001
Random Model vs. Alternative Model Alternative Model, M: P(x,y | M) = Õ pxi, yj pab = Probability that the letters “a” and “b” have each been derived independently from some common letter. Log-Odds Ratio (LOD): s(a,b) = ln (pab/qa qb) P(x,y|M)/P(x,y|R) = Õ (pxi yi/ qxi qyi) = Õ exp[s(xi,yi)] = exp[å s(xi, yi)] 12/1/2018 ©Bud Mishra, 2001
Score ©Bud Mishra, 2001 Score = ln [ P(x,y|M)/P(x,y|R) ] = å s(xi, yi) = s(x,y) Score Matrix or Substitution Matrix: A T C G 2 -1 Blossum 50 PAM (Point Accepted Mutation) 12/1/2018 ©Bud Mishra, 2001
1-PAM Matrix ©Bud Mishra, 2001 Let M be a probability transition matrix. Mab = Pr(a , b), a, b = chracters pa = Pr(“a” occurs in a string) fab = The number of times the mutation a , b was observed to occur. fa = åa ¹ b fab & f = å fa. K = 1-PAM Evolutionary distance “The amount of evolution that will change 1 in K characters on average.” 12/1/2018 ©Bud Mishra, 2001
Maa = 1 – ma, Mab = fab /(Kf pa) = (fab/fa) ma 1-PAM Matrix ma = fa /(Kf pa), Maa = 1 – ma, Mab = fab /(Kf pa) = (fab/fa) ma a-PAM Matrix = Ma M* = lima à 1 Ma Scorea(a,b) = 10 log10 Maab/pb Sequence comparison with 40 PAM, 120 PAM & 250 PAM score functions… 12/1/2018 ©Bud Mishra, 2001
Gaps: ©Bud Mishra, 2001 g = length of a gap, exponentially distributed » Exp(l) f(g) = l e-l g P(g) = f(g) Õ qxi ln P(g) = -l g + ln l + sum ln qxi = -d – (g-1) e (Affine Score Model) d = Gap-open Penalty e = Gap-Extension Penalty 12/1/2018 ©Bud Mishra, 2001
Multiple Sequence Alignment Defn:Given strings S1, S2, …Sk a multiple (global) alignment maps them to strings S’1, S’2, …, S’k (by inserting chosen spaces) such that |S’1| = |S’2| = L = |S’k|, and Removal of spaces from S’i contracts it to Si, for 1 5 i 5 k. 12/1/2018 ©Bud Mishra, 2001
Value of a Multiple Global Alignment The sum of pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all Ck,2 pairwise alignments induced by A. Given: Two strings S1 and S2. The expanded strings S’1 and S’2 correspond to a pairwise alignment. d(x, y) = distance between two characters x and y = 1, if x ¹ y and 0, if x = y. d(x,-) = d(-,y) = 1. Distance(S’1, S’2) = åi=1l d(S’1[i], S’2[i]), where l = |S’1|=|S’2|. 12/1/2018 ©Bud Mishra, 2001
Optimal Global Alignment An optimal SP(global) alignment of strings S1, S2, …, Sk is an alignment that has a minimum possible sum-of-pairs value for these strings among all possible multiple sequence alignments. 12/1/2018 ©Bud Mishra, 2001
Generalization of DP ©Bud Mishra, 2001 Assume |S1| = |S2| = L = |Sk| = n. The generalized k-dimensional DP table has (n+1)k entries. Each entry depends on 2k – 1 adjacent entries. D(i1, 0, …, 0) = i1 D(0, i2,…, 0) = i2 M D(0,0,…, ik) = ik D(i1, i2, …ik) = min; ¹ S µ {1..k} [ D[…, ij-1,…]j2 S + ål¹ m 2 S d(il, im) + |S| £ (n-|S|) ] 12/1/2018 ©Bud Mishra, 2001
Complexity ©Bud Mishra, 2001 The time and space complexity of the generalized DP solution of the multiple alignment problem is = O((2n)k) Theorem: The optimal SP alignment problem is NP-complete. In the worst-case, one cannot expect to do much better unless P=NP. 12/1/2018 ©Bud Mishra, 2001
P-Time Heuristics ©Bud Mishra, 2001 A Polynomial Time Approximate Algorithm for Multiple String Alignment: Assumption about the distance function: Triangle Inequality: 8chars, x ,y, z d(x,z) 5 d(x,y) + d(y,z) 8char, x d(x, x) = 0 D(S1. S2) , Value of the min. global alignment of S1 & S2. 12/1/2018 ©Bud Mishra, 2001
Algorithm ©Bud Mishra, 2001 Input: T = {S1, S2, …, Sk} Step 1: Find S1 2 CT that minimizes åS 2 T n {S1} D(S1, S) Time Complexity = O(k2 n2) Ck,2 DP each taking O(n2) time Call the remaining strings S2, …, Sk 12/1/2018 ©Bud Mishra, 2001
= O(k2 n2) + åi=1k O(i n2) = O(k2 n2) The ith Step Step i: Assume S1, …Si-1 have been aligned as S’1, …, S’i-1 Add Si: Run DP to align S’1 & Si a S”i and S’i Adjust S’2, …, S’i-1 by adding spaces where spaces were added in S’1 S1, S2, …, Si )aligned S’1, S’2, …, S’i. Length(S’1) in step i 5 i n. DP(S’1, Si) takes O(i n2) time Total time Complexity = O(k2 n2) + åi=1k O(i n2) = O(k2 n2) 12/1/2018 ©Bud Mishra, 2001
Competitiveness ©Bud Mishra, 2001 M = Alignment induced by the algorithm d(i,j) = Distance M induces on pair Si, Sj M* = Optimal alignment 2 SP(M) = åi=1k åj=1, j ¹ ik d(i,j) 5 åi=1k åj=1, j ¹ ik d(i,1) + d(1,j) (Triangle Inequality) = åi=1k åj=1, j ¹ ik d(1,i) + åi=1k åj=1, j ¹ ik d(1,j) (Symmetry) = åi=2k (k-1) d(1,i) +åj=2k (k-1) d(1,j) = 2(k-1) åi=2k d(1,i) 12/1/2018 ©Bud Mishra, 2001
Competitiveness ©Bud Mishra, 2001 SP(M) 5 2 (1 – 1/k) SP(M*) 2SP(M*) = åi=1k åj=1, j¹ ik D(Si, Sj) = åj=2k D(S1, Sj) + åj=1, j ¹ 2k D(S2, Sj) + L + åj=1k-1 D(Sk, Sj) = k åi=2k d(1,i) SP(M*) = (k/2) åi=2k d(1,i) ¸ (k/2) [SP(M)/(k-1)] SP(M) 5 2 (1 – 1/k) SP(M*) 12/1/2018 ©Bud Mishra, 2001