Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology Lecture #7: Local Alignment

Similar presentations


Presentation on theme: "Computational Biology Lecture #7: Local Alignment"— Presentation transcript:

1 Computational Biology Lecture #7: Local Alignment
Bud Mishra Professor of Computer Science and Mathematics 11 ¦ 5 ¦ 2001 12/1/2018 ©Bud Mishra, 2001

2 Local Alignment Problem (LAP)
Finding substrings of high similarity: Given two strings, S1 and S2: They may have regions that are locally highly similar. 12/1/2018 ©Bud Mishra, 2001

3 LAP Local Alignment Problem
Given: Two strings S1 and S2 Find: Substrings a v S1 and b v S2 whose similarity (in terms of an object function—e.g., optimal global alignment value) is maximum over all pairs of substrings from S1 and S2 v* = maxa v S1, b v S2 distance(a, b) 12/1/2018 ©Bud Mishra, 2001

4 Example ©Bud Mishra, 2001 Local Alignment:
S_1 = p q r a x a b c s t v q |--- a----| S_2 = x y a x b a c s l l |--- b----| d(x,x) = 2,d(x,y) = -2 d(x,-) = d(-,x) = -1. a = a x a b c s v S1 b = a x b a c s v S2 Local Alignment: a x a b c s | | | | | a x b a c s distance(a, b) = 8 12/1/2018 ©Bud Mishra, 2001

5 Naïve Complexity ©Bud Mishra, 2001
Note: (1) Let |S1| = n and |S2| = m. Total number of substrings of S1 = Cn+1,2 = O(n2) Total number of substrings of S2 = Cm+1,2 = O(m2) Naïvely, O(n2m2)candidate substrings need to be globally aligned by a DP algorithm of complexity O(|a| |b|) Complexity of the resulting algorithm = O(n3 m3) (2) An improved algorithm (SWAT, Smith-Waterman) reduces the time complexity to O(nm) 12/1/2018 ©Bud Mishra, 2001

6 LSAP Local Suffix Alignment Problem
A restricted version of the LAP. Given: Two strings S1 and S2 and two indices i 5 |S1| and j 5 |S2| Ai = S1[1..i] prefix of S1 Bj = S2[1..j] prefix of S2 Find: A suffix (possibly empty,l) of Ai (a = S1[k..i]) and a suffix of Bj (possibly empty, l) of Bj (b = S2[l..j]) that maximizes a linear objective function V(a, b) over all pairs of suffixes of Ai and Bj. ð 12/1/2018 ©Bud Mishra, 2001

7 Objective Function ©Bud Mishra, 2001
v(i,j) = maxa = suf S1[1..i], b = suf S2[1..j] V(a, b) = Value of the optimal local suffix alignment for the given index pair i, j. v* = maxi 5 n, j 5 m v(i,j) = Value of the optimal local alignment. n = |S1|, m = |S2| 12/1/2018 ©Bud Mishra, 2001

8 Optimal Local Alignment Recurrence Equations
v* = maxi 5 n, j 5 m V(i,j) a = suf S[1..i], b = suf S2[1..j] v* = v(i’, j’) = V(a, b) Consider an optimal suffix alignment with a = suf S1[1..i] and b = suf S2[1..j] Case 1: a = b = l (= empty string) Base: V(a, b) = 0 12/1/2018 ©Bud Mishra, 2001

9 Optimal Local Alignment Recurrence Equations
Case 2: a¹ l, a = a‘± S1[i] and S1[i] matches “-” Ind(A): V(a,b) = V(a’, b) + d(S1[i], -) …or S1[i] matches S2[j] ( b = b’ ± S2[j]) Ind(C): V(a,b) = V(a’, b’) + d(S[i], S2[j]) 12/1/2018 ©Bud Mishra, 2001

10 Optimal Local Alignment Recurrence Equations
Case 3: b ¹ l, b = b’ ± S2[j] and S2[j] matches “-” Ind(B): V(a,b) = V(a, b’) + d(-, S2[j]) …or S1[i] matches S2[j] ( a = a’ ± S1[i]) Ind(C): V(a,b) = V(a’, b’) + d(S[i], S2[j]) 12/1/2018 ©Bud Mishra, 2001

11 Recurrence Equation ©Bud Mishra, 2001
V(i,j) = maxa = suf S1[1..i], b = suf S2[1..j] V(a, b) Base: v(i,j)|i=0 Ç j=0 = 0 (v(0,0) = v(i,0) = v(0,j) = 0) Induction: v(i,j)|i=0 Æ j=0 =max[0, v(i-1,j) + d(S1[i],-), v(i,j-1)+ d(-, S2[j]), v(i-1,j-1), d(S1[i], S2[j]) ] 12/1/2018 ©Bud Mishra, 2001

12 Dynamic Programming Table
(with Traceback) Compute all v(i,j) entries: Complexity = O(nm) Find v* = v(i*, j*) by finding the largest value in any cell: Complexity = O(nm) Trace the pointer back from from v(I*, j*) until a cell is reached with value v(i’,j’) =0: Complexity = O(n+m) Results: a = S1[i’..i*] v S1 and b = S2[j’..j*] v S2 Total Complexity = O(nm) = O(|S1|, |S2|) 12/1/2018 ©Bud Mishra, 2001

13 Example ©Bud Mishra, 2001 12/1/2018 . x y a b c s l p q r -2 Ã 1 " 1
p q r -2 Ã 1 " 1 -4 Ã 3 Ã 2 -3 " 3 " 2 -5 Ã 4 Ã 0 " 4 -6 Ã 5 " 5 -8 Ã 7 t " 7 Ã 6 v " 6 12/1/2018 ©Bud Mishra, 2001

14 Dealing with Gaps ©Bud Mishra, 2001
A gap is any “maximal consecutive run of spaces” in a single string of a given alignment. gap, g2 gap, g3 c t t t a a c a a c c c a c c c a t c gap, g4 gap, g1 12/1/2018 ©Bud Mishra, 2001

15 Gaps ©Bud Mishra, 2001 Simple Gap Penalty Model a Constant Wt, Wg
Initial Gap A gap may be bordered on the right by the first character of a string. Final Gap A gap may be bordered on the left by the last character of a string. Internal Gap A gap may be bordered on both left and right Simple Gap Penalty Model a Constant Wt, Wg Each gap contributes a constant penalty = Wg d(x,x) = 2, d(x,y) = -2, d(x,-) = d(-,y) = 0 # gaps = k. Then Value of an alignment = åi=1l d(S’1[i], S’2[i]) – k Wg 12/1/2018 ©Bud Mishra, 2001

16 Biological Motivations for Gap Models
Unequal Crossing-over in Meiosis DNA slippage during replication Insertion of transposable elements (“Jumping Genes”) Insertion by retroviruses Translocation between chromosomes Examples of Alignment with gaps: cDNA matching problem Processed Pseudo-gene Problem 12/1/2018 ©Bud Mishra, 2001

17 Gap Weights ©Bud Mishra, 2001 Constant: Affine: Convex: Arbitrary:
Each gap has a penalty of Wg Each space is free: d(x,-) = d(-,x) = 0. Affine: Gap initiation weight = Wg Gap Extension weight = Ws Each gap of length q has a penalty of Wg + q Ws Convex: Each gap of length q has a penalty of Wg + ln q Ws Arbitrary: Each gap of length q has a penalty of Wg + w( q) Ws, where w(q) = arbitrary function 12/1/2018 ©Bud Mishra, 2001

18 General Model ©Bud Mishra, 2001 Arbitrary:
Each gap of length q has a penalty of Wg + w( q) Ws, where w(q) = arbitrary function w(q) = 0 a constant w(q) = q a linear/affine w(q) = ln q a convex Total Cost under constant model åi=1l d(S’1[i], S’2[i]) – (#gaps) Wg Total Cost under affine model åi=1l d(S’1[i], S’2[i]) – (#gaps) Wg – (#spaces) Ws 12/1/2018 ©Bud Mishra, 2001

19 Local Alignment under Arbitrary Gap Weight Model
Dynamic Programming (Needleman & Wunsch) Given two strings S1 and S2 start by aligning the prefixes S1,i = S1[1..i] and S2,j = S2[1..j] There are three different cases to consider… 12/1/2018 ©Bud Mishra, 2001

20 Case 1 S1[i] is aligned to a character strictly to the left of a character S2[j] S1,i S2,j S1[i] S2[j] 12/1/2018 ©Bud Mishra, 2001

21 Case 2 S1[i] is aligned to a character strictly to the right of a character S2[j] S1,i S2,j S1[i] S2[j] 12/1/2018 ©Bud Mishra, 2001

22 Case 3 S1[i] and S2[j] are aligned opposite each other: Subcase A S1[i] = S2[j] Subcase B S1[i] ¹ S2[j] S1,i S2,j S1[i] S2[j] 12/1/2018 ©Bud Mishra, 2001

23 Auxiliary Vaiables ©Bud Mishra, 2001 XL(i,j) =
maxalignments for case 1 distance(S1[1..i], S2[1..j]) XR(i,j) = maxalignments for case 2 distance(S1[1..i], S2[1..j]) XS(i,j) = maxalignments for case 3 distance(S1[1..i], S2[1..j]) V(i,j) = max(XL(i,j), XR(i,j), XS(i,j)) 12/1/2018 ©Bud Mishra, 2001

24 Recurrence: Base ©Bud Mishra, 2001 Notation: ? , “undefined”
XS(0,0) = 0, XS(i,0) = ?, XS(0,j) = ? XL(0,0) = ?, XL(i,0) = -w(i), XL(0,j) = ? XR(0,0) = ?, XR(i,0) = ?, XR(0,j) = -w(j) V(0,0) = 0, V(i,0) = -w(i), V(0,j) = -w(j) 12/1/2018 ©Bud Mishra, 2001

25 Recurrence: Induction
i > 0 and j > 0: XS(i,j) = V(i-1,j-1) + d(S1[i], S2[j]) XL(i,j) = max0 5 k 5 j-1 (V(i,k) - w(j-k)) XR(i,j) = max0 5 l 5 i-1 (V(l,j) - w(i-l)) V(i,j) = max(XL(i,j), XR(i,j),XS(i,j)) Each V(i,j) can be computed in time O(i+j) 12/1/2018 ©Bud Mishra, 2001

26 Total Time Complexity ©Bud Mishra, 2001 Let |S1| = n and |S2| = m.
The recurrence can be evaluated with a Dynamic Programming Table of space complexity = O(nm) and in time complexity = O(n2m+m2n) 12/1/2018 ©Bud Mishra, 2001

27 Affine Gap Model-Recurrence
SWAT : Smith-Waterman Modifying the recurrence equations for the affine case: XS(0,0) = 0, XS(i,0) = ?, XS(0,j) = ? XL(0,0) = ?, XL(i,0) = -Wg-i Ws, XL(0,j) = ? XR(0,0) = ?, XR(i,0) = ?, XR(0,j) = -Wg- j Ws V(0,0) = 0, V(i,0) = -Wg-i Ws, V(0,j) = -Wg- j Ws 12/1/2018 ©Bud Mishra, 2001

28 Recurrence: Induction
i > 0 and j > 0: XS(i,j) = V(i-1,j-1) + d(S1[i], S2[j]) XL(i,j) = max(XL(i, j-1) –Ws, ?, XS(i,j-1) – Wg –Ws, V(i,j-1)-Wg-Ws) = max[XL(i, j-1), V(i,j-1)-Wg] –Ws XR(i,j) = max(?, XR(i-1, j) –Ws, XS(i-1,j) – Wg –Ws, V(i-1,j)-Wg-Ws) = max[XR(i-1, j), V(i-1,j)-Wg] –Ws V(i,j) = max(XL(i,j), XR(i,j),XS(i,j)) Each V(i,j) can be computed in O(1) time. The optimal alignment with affine gap weights can be computed with a DP table of space and time complexity = O(nm). 12/1/2018 ©Bud Mishra, 2001

29 XS(i,j), XL(i,j), XR(i,j) and V(i,j).
Parallelization Systolic Arrays: Create a special-purpose processor P(i,j) for (i,j)th entry of the Dynamic Programming Table. Connect P(i,j) to P(i-1,j), P(i-1,j-1) and P(i, j-1) Each processor holds static data Wg and Ws. Each processor stores and transmits dynamic data: XS(i,j), XL(i,j), XR(i,j) and V(i,j). 12/1/2018 ©Bud Mishra, 2001

30 Systolic Computation ©Bud Mishra, 2001
Dynamically compute in one cycle: XS(i,j), XL(i,j), XR(i,j), V(i,j) using XS(i-1,j), XL(i-1,j), XR(i-1,j), V(i-1,j) XS(i,j-1), XL(i,j-1), XR(i,j-1), V(i,j-1) XS(i-1,j-1), XL(i-1,j-1), XR(i-1,j-1), V(i-1,j-1) and Wg & Ws. 12/1/2018 ©Bud Mishra, 2001

31 Database Search BLAST FAST ©Bud Mishra, 2001 Blast & Its relatives:
A query search \Rightarrow Compare the query sequence to all the sequences in the database for local similarities. Heuristics: BLAST FAST Needs good complexity Analysis 12/1/2018 ©Bud Mishra, 2001

32 BLAST ©Bud Mishra, 2001 Basic Local Alignment Search Tool
Query sequence, s 2 S*, Database, L µ S* BLAST returns a list of high scoring segment pairs between the query sequence and sequences in the database. Score function depends on a-PAM score functions. 12/1/2018 ©Bud Mishra, 2001

33 W=All w-mers that score at least Q with some w-mer of the query.
BLAST Heuristics BLAST is a 3 step algorithm: Step 1. Compile list of high scoring strings: W = words. W=All w-mers that score at least Q with some w-mer of the query. Step 2. Search for hits—Each hit defines a seed. Construct a DFA to recognize \cW. Scan the database compiling the hits. Step 3. Extend the seeds. The seeds are extended in both directions until the score falls a certain distance below the best so far. 12/1/2018 ©Bud Mishra, 2001

34 FAST s, t = Two sequences being compared. |s| = m & |t| = n. Step 1. Determine k-tuples common to both sequences—k = 1 or 2. Step 2. “Offset” of a common k-tuple is computed. If the common k-tuples start at position s[i] and t[j], then offset = i-j Step 3. Determine the most common offset value to align the sequences. Step 4. Combine the common k-tuples to create a region. 12/1/2018 ©Bud Mishra, 2001

35 Example s= H A R F Y A A Q I V L t = V D M A A Q I A {9} {-2,2,3} {-3,1,2} {-6,-2,-1} {2} Offsets for 1-tuples A ( (2,6,7) F ( (4) H ( (1) I ( (9) L ( (11) Q ( (8) R ( (3) V ( (10) Y ( (5) Alignment: H A R F Y A A Q I V L | | | | + V D MA AQ I A 12/1/2018 ©Bud Mishra, 2001


Download ppt "Computational Biology Lecture #7: Local Alignment"

Similar presentations


Ads by Google