Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology Lecture #6: Matching and Alignment

Similar presentations


Presentation on theme: "Computational Biology Lecture #6: Matching and Alignment"— Presentation transcript:

1 Computational Biology Lecture #6: Matching and Alignment
Bud Mishra Professor of Computer Science and Mathematics 10 ¦ 22 ¦ 2001 12/1/2018 ©Bud Mishra, 2001

2 Inexact Matching ©Bud Mishra, 2001 Example: Edit Distance Problem:
Edit distance between two biological sequences May correspond to: Evolutionary Distance Functional Distance Structural Distance 12/1/2018 ©Bud Mishra, 2001

3 Edit Distance ©Bud Mishra, 2001
Simplest distance function corresponds to: EDIT DISTANCE Atomic Edit Functions: Insertion AATCGG a AATACGG Deletion AATACGG a AATCGG Substitution AATCGG a AATAGG A composite edit function ' Function Composition of Atomic Edit Functions 12/1/2018 ©Bud Mishra, 2001

4 Cost of a Composite Edit Function
(Based on the cost or distance for Atomic Edit Functions) Given: Two strings S1 and S2 Distance(S1, S2) = min { cost(E) | E(S1) = S2} where E = composite edit function mapping S1 to S2. 12/1/2018 ©Bud Mishra, 2001

5 Some Properties of Distance Function
Assume: (8 e= Atomic Edit Function) cost(e) = cost(e-1) Distance(S1, S2) = Distance(S2, S1) Symmetric Distance(S1, S1) = 0 Distance(S1, S2) + Distance(S2, S3) 5 Distance(S1, S3) Triangle Inequality Simplest Cost Function: Each atomic edit function is of unit cost. 12/1/2018 ©Bud Mishra, 2001

6 Edit Operations ©Bud Mishra, 2001
I: Insertion of a character into the first string S1 D: Deletion of a character from the first string S1 R: Replacement (or Substitution) of a character in the first string S1 with a character in the second string S2 M: Matching (Identity) 12/1/2018 ©Bud Mishra, 2001

7 Edit Transcript ©Bud Mishra, 2001 Example: GATT A CA | | | | |
The complete edit function is described by an “edit transcript” EDIT TRANSCRIPT = s 2 {D, M, R, I}* Example (in left): Edit transcript = DDMIRIMMII Edit Distance = =7 GATT A CA | | | | | TTCGGCATT D 1 2 MM I 3 R 4 5 6 7 12/1/2018 ©Bud Mishra, 2001

8 Levenshtein Distance or Edit Distance
Edit Distance between two strings S1 and S2 is defined as the minimum number of atomic edit operations – insertions, deletions (indels), and substitutions – needed to transform the first string into the second Optimal Transcript = An edit transcript corresponding to the minimum number of atomic edit operations of unit cost. EDP The Edit Distance Problem is to compute the edit distance between two given strings, along with an optimal edit transcript that describes the transformation (alignment) 12/1/2018 ©Bud Mishra, 2001

9 Dynamic Programming Calculation of Edit Distance
Define: D(i,j) ´ Min number of atomic edit operations needed to transform the first i characters of S1 into the first j characters of S2 ´ EditDistance(S1[1..i], S2[1..j]) |S1| = n |S2| = m Distance(S1, S2) = D(n,m) Dynamic Programming: 3 components: Recurrence Relation Tabular Computation Traceback 12/1/2018 ©Bud Mishra, 2001

10 Recurrence ©Bud Mishra, 2001 D(i-1,j-1) D(i-1,j) D(i,j-1) D(i,j)
Base Relation: D(0,0) = 0 EditDistance(l,l) = 0 Recurrence Relations: In 1 coordinate: D(i,0) = D(i-1,0)+1 (S1[i] deleted) D(0,j) = D(0,j-1)+1 (S2[j] inserted) EditDistance(S1[1..i], l) =i (i deletions) EditDistance(l, S2[1..j]) =j (j insertions) In both coordiates: D(i,j) = min {D(i-1,j) +1, (S1[i] deleted) {D(i,j-1)+1, (S2[j] inserted) {D(i-1,j-1)+ D(i,j)} (substn or match) D(i,j) = 1, if S1[i] ¹ S2[j]; 0 otherwise. D(i-1,j-1) D(i-1,j) D(i,j-1) D(i,j) D(i,j) +1 +1 12/1/2018 ©Bud Mishra, 2001

11 Efficient Tabular Computation of Edit Distance
Recursive Implementation a 2O(n+m)-time computation Bottom-up computation (n+1)£ (m+1) distinct values for D(i,j) to be computed Dynamic Programming Table of size (n+1)£ (m+1) String S_1 corresponds to the rows (Vertical Axis) String S_2 corresponds to the columns (Horizontal Axis) Fill out D(i,0) Ã First Column Fill out D(0,j) Ã First Row Fill out rows D(i,j) Ã Left-to-Right (increasing i) 12/1/2018 ©Bud Mishra, 2001

12 The Algorithm ©Bud Mishra, 2001 Time complexity = O(nm)
for i=0 to n do D(i,0) Ã i; for j=0 to m do D(0,j) Ã j; for i=1 to n do for j=1 to m do D(i,j) Ã min[ D(i-1,j)+1, D(i,j-1)+1, D(i-1,j-1) + D(i,j)] Time complexity = O(nm) 12/1/2018 ©Bud Mishra, 2001

13 Example ©Bud Mishra, 2001 S2 ! S1 ! A T C G Solutions with minimal
edit distance =4 (sol.1) T A-GGC AATCGG (Edit Script= RMIRMR) (sol.2) - -TAGGC AATCGG - (Edit Script = IIMRMMD) A T C G 1 2 3 4 5 6 S1 ! 12/1/2018 ©Bud Mishra, 2001

14 Trace Back ©Bud Mishra, 2001 Extracting Optimal Edit Transcript:
Set a pointer from: Cell(i,j) ! Cell(i,j-1), if D(i,j) = D(i,j-1)+1 Horizontal Edge ) I, Insertion Cell(i,j) ! Cell(i-1,j), if D(i,j) = D(i-1,j)+1 Vertical Edge ) D, Deletion Cell(i,j) ! Cell(i-1,j-1), if D(i,j) = D(i-1,j-1)+D(i,j) Diagonal Edge ) R, Substitution, if D(i,j)=1 M. Match, if D(i,j) = 0. Optimal Edit Transcript can be computed in O(n+m) additional time. 12/1/2018 ©Bud Mishra, 2001

15 GAPS: The Scoring Model
Basic operations: Sequencing Errors or Evolutionary processes of Mutations and Selections Substitution: Changes one base to another. Gaps: Insertions or Deletions: Adds or removes a base. Total Score Assigned to an Alignment= Sum of terms for each aligned pair of bases plus terms for each gap. 12/1/2018 ©Bud Mishra, 2001

16 Total Score of an Alignment with Gaps
Total Score Assigned to an Alignment Corresponds to log of the Relative likelihood that the two sequences are related compared to being unrelated. Assumptions: Mutations or Sequencing Errors at different sites in a sequence occur independently. 12/1/2018 ©Bud Mishra, 2001

17 Substitution Matrices
Notation: x and y ´ Pairs of sequences, |x| = n and |y| = m. x, y 2 (A+G+C+T)* xi = ith symbol in x yj = jth symbol in y Random Model, R: P(x,y|R) = Õ qxi Õ qyj qa = probability that the letter “a” occurs independently at a given site. 12/1/2018 ©Bud Mishra, 2001

18 Random Model vs. Alternative Model
Alternative Model, M: P(x,y | M) = Õ pxi, yj pab = Probability that the letters “a” and “b” have each been derived independently from some common letter. Log-Odds Ratio (LOD): s(a,b) = ln (pab/qa qb) P(x,y|M)/P(x,y|R) = Õ (pxi yi/ qxi qyi) = Õ exp[s(xi,yi)] = exp[å s(xi, yi)] 12/1/2018 ©Bud Mishra, 2001

19 Score ©Bud Mishra, 2001 Score = ln [ P(x,y|M)/P(x,y|R) ]
= å s(xi, yi) = s(x,y) Score Matrix or Substitution Matrix: A T C G 2 -1 Blossum 50 PAM (Point Accepted Mutation) 12/1/2018 ©Bud Mishra, 2001

20 1-PAM Matrix ©Bud Mishra, 2001
Let M be a probability transition matrix. Mab = Pr(a , b), a, b = chracters pa = Pr(“a” occurs in a string) fab = The number of times the mutation a , b was observed to occur. fa = åa ¹ b fab & f = å fa. K = 1-PAM Evolutionary distance “The amount of evolution that will change 1 in K characters on average.” 12/1/2018 ©Bud Mishra, 2001

21 Maa = 1 – ma, Mab = fab /(Kf pa) = (fab/fa) ma
1-PAM Matrix ma = fa /(Kf pa), Maa = 1 – ma, Mab = fab /(Kf pa) = (fab/fa) ma a-PAM Matrix = Ma M* = lima à 1 Ma Scorea(a,b) = 10 log10 Maab/pb Sequence comparison with 40 PAM, 120 PAM & 250 PAM score functions… 12/1/2018 ©Bud Mishra, 2001

22 Gaps: ©Bud Mishra, 2001 g = length of a gap,
exponentially distributed » Exp(l) f(g) = l e-l g P(g) = f(g) Õ qxi ln P(g) = -l g + ln l + sum ln qxi = -d – (g-1) e (Affine Score Model) d = Gap-open Penalty e = Gap-Extension Penalty 12/1/2018 ©Bud Mishra, 2001

23 Multiple Sequence Alignment
Defn:Given strings S1, S2, …Sk a multiple (global) alignment maps them to strings S’1, S’2, …, S’k (by inserting chosen spaces) such that |S’1| = |S’2| = L = |S’k|, and Removal of spaces from S’i contracts it to Si, for 1 5 i 5 k. 12/1/2018 ©Bud Mishra, 2001

24 Value of a Multiple Global Alignment
The sum of pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all Ck,2 pairwise alignments induced by A. Given: Two strings S1 and S2. The expanded strings S’1 and S’2 correspond to a pairwise alignment. d(x, y) = distance between two characters x and y = 1, if x ¹ y and 0, if x = y. d(x,-) = d(-,y) = 1. Distance(S’1, S’2) = åi=1l d(S’1[i], S’2[i]), where l = |S’1|=|S’2|. 12/1/2018 ©Bud Mishra, 2001

25 Optimal Global Alignment
An optimal SP(global) alignment of strings S1, S2, …, Sk is an alignment that has a minimum possible sum-of-pairs value for these strings among all possible multiple sequence alignments. 12/1/2018 ©Bud Mishra, 2001

26 Generalization of DP ©Bud Mishra, 2001
Assume |S1| = |S2| = L = |Sk| = n. The generalized k-dimensional DP table has (n+1)k entries. Each entry depends on 2k – 1 adjacent entries. D(i1, 0, …, 0) = i1 D(0, i2,…, 0) = i2 M D(0,0,…, ik) = ik D(i1, i2, …ik) = min; ¹ S µ {1..k} [ D[…, ij-1,…]j2 S + ål¹ m 2 S d(il, im) + |S| £ (n-|S|) ] 12/1/2018 ©Bud Mishra, 2001

27 Complexity ©Bud Mishra, 2001
The time and space complexity of the generalized DP solution of the multiple alignment problem is = O((2n)k) Theorem: The optimal SP alignment problem is NP-complete. In the worst-case, one cannot expect to do much better unless P=NP. 12/1/2018 ©Bud Mishra, 2001

28 P-Time Heuristics ©Bud Mishra, 2001
A Polynomial Time Approximate Algorithm for Multiple String Alignment: Assumption about the distance function: Triangle Inequality: 8chars, x ,y, z d(x,z) 5 d(x,y) + d(y,z) 8char, x d(x, x) = 0 D(S1. S2) , Value of the min. global alignment of S1 & S2. 12/1/2018 ©Bud Mishra, 2001

29 Algorithm ©Bud Mishra, 2001 Input: T = {S1, S2, …, Sk}
Step 1: Find S1 2 CT that minimizes åS 2 T n {S1} D(S1, S) Time Complexity = O(k2 n2) Ck,2 DP each taking O(n2) time Call the remaining strings S2, …, Sk 12/1/2018 ©Bud Mishra, 2001

30 = O(k2 n2) + åi=1k O(i n2) = O(k2 n2)
The ith Step Step i: Assume S1, …Si-1 have been aligned as S’1, …, S’i-1 Add Si: Run DP to align S’1 & Si a S”i and S’i Adjust S’2, …, S’i-1 by adding spaces where spaces were added in S’1 S1, S2, …, Si )aligned S’1, S’2, …, S’i. Length(S’1) in step i 5 i n. DP(S’1, Si) takes O(i n2) time Total time Complexity = O(k2 n2) + åi=1k O(i n2) = O(k2 n2) 12/1/2018 ©Bud Mishra, 2001

31 Competitiveness ©Bud Mishra, 2001
M = Alignment induced by the algorithm d(i,j) = Distance M induces on pair Si, Sj M* = Optimal alignment 2 SP(M) = åi=1k åj=1, j ¹ ik d(i,j) 5 åi=1k åj=1, j ¹ ik d(i,1) + d(1,j) (Triangle Inequality) = åi=1k åj=1, j ¹ ik d(1,i) + åi=1k åj=1, j ¹ ik d(1,j) (Symmetry) = åi=2k (k-1) d(1,i) +åj=2k (k-1) d(1,j) = 2(k-1) åi=2k d(1,i) 12/1/2018 ©Bud Mishra, 2001

32 Competitiveness ©Bud Mishra, 2001 SP(M) 5 2 (1 – 1/k) SP(M*)
2SP(M*) = åi=1k åj=1, j¹ ik D(Si, Sj) = åj=2k D(S1, Sj) + åj=1, j ¹ 2k D(S2, Sj) + L + åj=1k-1 D(Sk, Sj) = k åi=2k d(1,i) SP(M*) = (k/2) åi=2k d(1,i) ¸ (k/2) [SP(M)/(k-1)] SP(M) 5 2 (1 – 1/k) SP(M*) 12/1/2018 ©Bud Mishra, 2001


Download ppt "Computational Biology Lecture #6: Matching and Alignment"

Similar presentations


Ads by Google