Download presentation
Presentation is loading. Please wait.
1
Lectures 3-6: Pair-wise Sequence Alignment
CS 4233& 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment
2
Outline Part I: Algorithms Part II: Biological issues Part III: BLAST
Biological problem Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient algorithms Part II: Biological issues Model gaps more accurately Alignment statistics Part III: BLAST
3
Evolution at the DNA level
C …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… DNA evolutionary events (sequence edits): Mutation, deletion, insertion
4
Sequence conservation implies function
next generation OK OK OK X X Still OK?
5
Why sequence alignment?
Conserved regions are more likely to be functional Can be used for finding genes, regulatory elements, etc. Similar sequences often have similar origin and function Can be used to predict functions for new genes / proteins Sequence alignment is one of the most widely used computational tools in biology
6
Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC T S’ -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC T’ Definition An alignment of two strings S, T is a pair of strings S’, T’ (with spaces) s.t. |S’| = |T’|, and (|S| = “length of S”) removing all spaces in S’, T’ leaves S, T
7
What is a good alignment?
The “best” way to match the letters of one sequence with those of the other How do we define “best”?
8
The score of aligning (characters or spaces) x & y is σ (x,y).
S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC The score of aligning (characters or spaces) x & y is σ (x,y). Score of an alignment: An optimal alignment: one with max score
9
Scoring Function Sequence edits: AGGCCTC Scoring Function:
Mutations AGGACTC Insertions AGGGCCTC Deletions AGG-CTC Scoring Function: Match: +m ~~~AAC~~~ Mismatch: -s ~~~A-A~~~ Gap (indel): -d
10
Match = 2, mismatch = -1, gap = -1
Score = 3 x 2 – 2 x 1 – 1 x 1 = 3
11
More complex scoring function
Substitution matrix Similarity score of matching two letters a, b should reflect the probability of a, b derived from the same ancestor It is usually defined by log likelihood ratio Active research area. Especially for proteins. Commonly used: PAM, BLOSUM
12
An example substitution matrix
C G T 3 -2 -1
13
How to find an optimal alignment?
A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz
14
Analysis Assume |S| = |T| = n Cost of evaluating one alignment: ≥n
How many alignments are there: pick n chars of S,T together say k of them are in S match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 240 >1012 operations
15
Intro to Dynamic Programming
16
Dynamic programming What is dynamic programming? Two simple examples:
A method for solving problems exhibiting the properties of overlapping subproblems and optimal substructure Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly Two simple examples: Computing Fibonacci numbers Find the special shortest path in a grid
17
Example 1: Fibonacci numbers
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … F(0) = 1; F(1) = 1; F(n) = F(n-1) + F(n-2) How to compute F(n)?
18
A recursive algorithm function fib(n) if (n == 0 or n == 1) return 1;
else return fib(n-1) + fib(n-2); F(9) F(8) F(7) F(7) F(6) F(5) F(6) F(5) F(4) F(3)
19
Why recursive Fib algorithm is inefficient?
Time complexity: Between 2n/2 and 2n O(1.62n), i.e. exponential Why recursive Fib algorithm is inefficient? Overlapping subproblems
20
An iterative algorithm
function fib(n) F[0] = 1; F[1] = 1; for i = 2 to n F[i] = F[i-1] + F[i-2]; Return F[n]; Time complexity: Time: O(n), space: O(n)
21
Example 2: shortest path in a grid
Each edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length
22
Optimal substructures
Naïve algorithm: enumerate all possible paths and compare costs Exponential number of paths Key observation: If a path P(S, G) is the shortest from S to G, any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x
23
Proof Proof by contradiction
If the path between P(S,x) is not the shortest, i.e., P’(S,x) < P(S,x) Construct a new path P’(S,G) = P’(S,x) + P(x, G) P’(S,G) < P(S,G) => P(S,G) is not the shortest Contradiction Therefore, P(S, x) is the shortest S x G
24
Recursive solution Index each intersection by two indices, (i, j)
(0,0) Index each intersection by two indices, (i, j) Let F(i, j) be the total length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted. To compute F(m, n), we need to compute both F(m-1, n) and F(m, n-1) m (m, n) n F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min F(m, n-1) + length((m, n-1), (m, n))
25
Recursive solution F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min
(0,0) But: if we use recursive call, many subpaths will be recomputed for many times Strategy: pre-compute F values starting from the upper-left corner. Fill in row by row (what other order will also do?) (i-1, j) (i, j-1) (i, j) m (m, n) n
26
Dynamic programming illustration
3 9 1 2 3 12 13 15 5 3 3 3 3 3 2 5 2 5 6 8 13 15 2 3 3 9 3 2 4 2 3 7 9 11 13 16 6 2 3 7 4 3 6 3 3 13 11 14 17 20 4 6 3 1 3 1 2 3 2 17 17 17 18 20 G F(i-1, j) + length(i-1, j, i, j) F(i, j) = min F(i, j-1) + length(i, j-1, i, j)
27
Trackback 3 9 1 2 3 12 13 15 5 3 3 3 3 3 2 5 2 5 6 8 13 15 2 3 3 9 3 2 4 2 3 7 9 11 13 16 6 2 3 7 4 3 6 3 3 13 11 14 17 20 4 6 3 1 3 1 2 3 2 17 17 17 18 20
28
Elements of dynamic programming
Optimal sub-structures Optimal solutions to the original problem contains optimal solutions to sub-problems Overlapping sub-problems Some sub-problems appear in many solutions Memorization and reuse Carefully choose the order that sub-problems are solved
29
Dynamic Programming for sequence alignment
Suppose we wish to align x1……xM y1……yN Let F(i,j) = optimal score of aligning x1……xi y1……yj Scoring Function: Match: +m Mismatch: -s Gap (indel): -d
30
Elements of dynamic programming
Optimal sub-structures Optimal solutions to the original problem contains optimal solutions to sub-problems Overlapping sub-problems Some sub-problems appear in many solutions Memorization and reuse Carefully choose the order that sub-problems are solved
31
Optimal substructure ... x: y:
1 2 i M j N x: y: If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then The alignment between x[1..i] and y[1..j] is also optimal Easy to prove by contradiction
32
Recursive solution ~~~~~~~ yN ~~~~~~~ ~~~~~~~ yN max
Notice three possible cases: xM aligns to yN ~~~~~~~ xM ~~~~~~~ yN 2. xM aligns to a gap ~~~~~~~ xM ~~~~~~~ yN aligns to a gap ~~~~~~~ yN m, if xM = yN F(M,N) = F(M-1, N-1) + -s, if not max F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d
33
Recursive solution Generalize: F(i,j) = max F(i-1, j) – d
F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d (Xi,Yj) = m if Xi = Yj, and –s otherwise Boundary conditions: F(0, 0) = 0. F(0, j) = ? F(i, 0) = ? -jd: y[1..j] aligned to gaps. -id: x[1..i] aligned to gaps.
34
What order to fill? F(0,0) F(M,N) F(i, j) F(i, j-1) F(i-1, j)
2 3 i j
35
What order to fill? F(0,0) F(M,N)
36
Example A G T F(i,j) i = 0 1 2 3 4 j = 0 1 2 3 x = AGTA m = 1
y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T j = 0 1 2 3
37
Example A G T -1 -2 -3 -4 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 j = 0 1 2 3
38
Example A G T -1 -2 -3 -4 1 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 j = 0 1 2 3
39
Example A G T -1 -2 -3 -4 1 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 j = 0 1 2 3
40
Example A G T -1 -2 -3 -4 1 2 F(i,j) i = 0 1 2 3 4 Optimal Alignment:
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 Optimal Alignment: F(4,3) = 2 1 2 3
41
Example A G T -1 -2 -3 -4 1 2 F(i,j) i = 0 1 2 3 4 j = 0
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 Optimal Alignment: F(4,3) = 2 This only tells us the best score 1 2 3
42
Trace-back A G T -1 -2 -3 -4 1 2 A F(i,j) i = 0 1 2 3 4 j = 0 1 2 3
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 A 2 3
43
Trace-back A G T -1 -2 -3 -4 1 2 T A F(i,j) i = 0 1 2 3 4 j = 0 1 2 3
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 T A 2 3
44
Trace-back A G T -1 -2 -3 -4 1 2 G T A - F(i,j) i = 0 1 2 3 4 j = 0 1
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 G T A - 2 3
45
Trace-back A G T -1 -2 -3 -4 1 2 A G T - F(i,j) i = 0 1 2 3 4 j = 0 1
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 A G T - 2 3
46
Trace-back A G T -1 -2 -3 -4 1 2 F(i,j) i = 0 1 2 3 4
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 Optimal Alignment: F(4,3) = 2 AGTA ATA 1 2 3
47
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 j = 0 1 2 3
48
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 1 j = 0 1 2 3
49
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 1 j = 0 1 2 3
50
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 2 3
51
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 2 3
52
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 2 3
53
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 1 2 3
54
Using trace-back pointers
x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = A G T -1 -2 -3 -4 1 2 j = 0 Optimal Alignment: F(4,3) = 2 AGTA ATA 1 2 3
55
The Needleman-Wunsch Algorithm
Initialization. F(0, 0) = 0 F(0, j) = - j d F(i, 0) = - i d Main Iteration. Filling in scores For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(xi, yj) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3] Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment
56
Complexity Time: Space:
O(NM) Space: Linear-space algorithms do exist (with the same time complexity)
57
Equivalent graph problem
S1 = A G T A (0,0) : a gap in the 2nd sequence : a gap in the 1st sequence : match / mismatch 1 1 S2 = A 1 T Value on vertical/horizontal line: -d Value on diagonal: m or -s 1 1 A (3,4) Number of steps: length of the alignment Path length: alignment score Optimal alignment: find the longest path from (0, 0) to (3, 4) General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.
58
Question If we change the scoring scheme, will the optimal alignment be changed? Old: Match = 1, mismatch = gap = -1 New: match = 2, mismatch = gap = 0 New: Match = 2, mismatch = gap = -2?
59
Question What kind of alignment is represented by these paths? A- BC
Alternating gaps are impossible if –s > -2d
60
A variant of the basic algorithm
Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1) Score = -7 Score = -2
61
A variant of the basic algorithm
Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don’t want to penalize gaps in the ends
62
The Overlap Detection variant
Changes: Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 Termination maxi F(i, N) FOPT = max maxj F(M, j) x1 ……………………………… xM yN ……………………………… y1
63
Different types of overlaps
x x y y
64
The local alignment problem
Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y
65
Why local alignment Conserved regions may be a small part of the whole
Global alignment might miss them if flanking “junk” outweighs similar regions Genes are shuffled between genomes A B C D B D A C
66
Naïve algorithm for all substrings X’ of X and Y’ of Y
Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.
67
Reminder The overlap detection algorithm
We do not give penalty to gaps at either end Free gap Free gap
68
The local alignment idea
Do not penalize the unaligned regions (gaps or mismatches) The alignment can start anywhere and ends anywhere Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment By resetting alignment score to zero
69
The Smith-Waterman algorithm
Initialization: F(0, j) = F(i, 0) = 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max
70
The Smith-Waterman algorithm
Termination: If we want the best local alignment… FOPT = maxi,j F(i, j) If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back
71
x c d e a b Match: 2 Mismatch: -1 Gap: -1
72
x c d e a b Match: 2 Mismatch: -1 Gap: -1
73
x c d e a b 2 1 Match: 2 Mismatch: -1 Gap: -1
74
x c d e a b 2 1 Match: 2 Mismatch: -1 Gap: -1
75
x c d e a b 2 1 3 Match: 2 Mismatch: -1 Gap: -1
76
x c d e a b 2 1 3 5 Match: 2 Mismatch: -1 Gap: -1
77
x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1
78
Trace back x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1
79
Trace back x c d e a b 2 1 3 5 4 cxde | || c-de x-de | || xcde
a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde
80
No negative values in local alignment DP array
Optimal local alignment will never have a gap on either end Local alignment: “Smith-Waterman” Global alignment: “Needleman-Wunsch”
81
Analysis Time: Memory: O(MN) for finding the best alignment
Time to report all alignments depends on the number of sub-opt alignments Memory: O(MN) O(M+N) possible
82
More efficient alignment algorithms
83
Given two sequences of length M, N Time: O(MN) Space: O(MN)
Ok, but still slow for long sequences Space: O(MN) bad 1Mb seq x 1Mb seq = 1TB memory Can we do better?
84
Bounded alignment Good alignment should appear near the diagonal
85
Bounded Dynamic Programming
If we know that x and y are very similar Assumption: # gaps(x, y) < k xi Then, | implies | i – j | < k yj
86
Bounded Dynamic Programming
Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination: same x1 ………………………… xM yN ………………………… y1 k
87
Analysis Time: O(kM) << O(MN) Space: O(kM) with some tricks
=> M M 2k 2k
89
Given two sequences of length M, N Time: O(MN) Space: O(MN)
ok Space: O(MN) bad 1mb seq x 1mb seq = 1TB memory Can we do better?
90
Linear space algorithm
If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?
91
Linear space algorithm
When we finish, we know how we have aligned the ends of the sequences XM YN Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))
92
(0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.
93
(0,0) (3,0) (3,2) (3,4) (3,6) (6,6) Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))
94
Hirschberg’s idea Divide and conquer! Forward algorithm
Y X Forward algorithm Align x1x2…xM/2 with Y M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk
95
Backward Algorithm Backward algorithm
Y X Backward algorithm Align reverse(xM/2+1…xM) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )
96
Linear-space alignment
Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2
97
Linear-space alignment
Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2
98
Linear-space alignment
Iterate this procedure to the two sub-problems! M/2 k* M/2 N-k*
99
Analysis Memory: O(N) for computation, O(N+M) to store the optimal alignment Time: MN for first iteration k M/2 + (N-k) M/2 = MN/2 for second … k M/2 M/2 N-k
100
MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …)
= 2MN = O(MN) MN/8
101
Outline Part I: Algorithms Part II: Biological issues Part III: BLAST
Biological problem Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient algorithms Part II: Biological issues Model gaps more accurately Alignment statistics Part III: BLAST
102
How to model gaps more accurately
103
What’s a better alignment?
GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d However, gaps usually occur in bunches. During evolution, chunks of DNA may be lost entirely Aligning genomic sequences vs. cDNAs (reverse complimentary to mRNAs)
104
Model gaps more accurately
Current model: Gap of length n incurs penalty nd General: Convex function E.g. (n) = c * sqrt (n) n n
105
General gap dynamic programming
Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O((M+N)MN) (cubic) Space: O(NM) (linear-space algorithm not applicable)
106
Compromise: affine gaps
(n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 We want to find the optimal alignment with affine gap penalty in O(MN) time O(MN) or better O(M+N) memory
107
Allowing affine gap penalties
Still three cases xi aligned with yj xi aligned to a gap Are we continuing a gap in x? (if no, start is more expensive) yj aligned to a gap Are we continuing a gap in y? (if no, start is more expensive) We can use a finite state machine to represent the three cases as three states The machine has two heads, reading the chars on the two strings separately At every step, each head reads 0 or 1 char from each sequence Depending on what it reads, goes to a different state, and produces different scores
108
Finite State Machine ? / ? ? / ? ? / ? ? / ? ? / ? ? / ? ? / ?
Input Output ? / ? ? / ? Ix ? / ? ? / ? F ? / ? Iy ? / ? State ? / ? F: have just read 1 char from each seq (xi aligned to yj ) Ix: have read 0 char from x. (yj aligned to a gap) Iy: have read 0 char from y (xi aligned to a gap)
109
(-, yj) / e (xi,yj) / (xi,yj) / (-, yj) / d (xi,-) / d (xi,-) / e
Input Output (-, yj) / e (xi,yj) / Ix (xi,yj) / (-, yj) / d F (xi,-) / d Iy (xi,-) / e Start state (xi,yj) / Current state Input Output Next state F (xi,yj) (-,yj) d Ix (xi,-) Iy e …
110
(-, yj) / e (xi,yj) / (-, yj) / d (xi,-) / d (xi,-) / e
F Ix Iy (xi,yj) / (xi,-) / d (xi,-) / e (-, yj) / d (-, yj) / e start state F-F-F-F F-Iy-F-F-Ix F-F-Iy-F-Ix AAC ACT AAC ||| ACT AAC- || -ACT AAC- | | A-CT Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: find a state path to read the two sequences such that the total output score is the highest
111
Dynamic programming We encode this information in three different matrices For each element (i,j) we use three variables F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap xi xi xi yj yj yj F(i, j) Ix(i, j) Iy(i, j)
112
(-, yj)/e (xi,yj) / (xi,yj) / (-, yj) /d (xi,-) /d (xi,-)/e
Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j-1) + (xi, yj) F(i, j) = max Ix(i-1, j-1) + (xi, yj) Iy(i-1, j-1) + (xi, yj) xi yj
113
F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e (-, yj)/e (xi,yj) /
(-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xi yj Ix(i, j)
114
F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e (-, yj)/e (xi,yj) /
Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xi yj Iy(i, j)
115
F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1)
F(i, j – 1) + d Ix(i, j) = max Ix(i, j – 1) + e F(i – 1, j) + d Iy(i, j) = max Iy(i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y
116
Data dependency F i Ix Iy j i-1 i-1 j-1 j-1
117
Data dependency F i Ix Iy j i i j j
118
Data dependency If we stack all three matrices No cyclic dependency
Therefore, we can fill in all three matrices in order
119
Algorithm for i = 1:m F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))
for j = 1:n Fill in F(i, j), Ix(i, j), Iy(i, j) end F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) Time: O(MN) Space: O(MN) or O(N) when combined with the linear-space algorithm
120
Exercise x = GCAC y = GCC m = 2 s = -2 d = -5 e = -1
121
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A
- x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F: aligned on both Iy: Insertion on y y = G C C F(i-1, j-1) Iy(i-1, j-1) x = - -5 -6 -7 Iy(i-1,j) (xi, yj) F(i-1,j) G C A e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix: Insertion on x
122
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = - -5 -6 -7 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix
123
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = - -5 -6 -7 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix
124
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = - -5 -6 -7 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix
125
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 G C A F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix
126
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 G C A F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix
127
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 Iy(i-1,j) G C A F(i-1,j) e=-1 d=-5 Iy(i,j) Ix
128
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix
129
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix
130
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix
131
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 -12 -1 G C A F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix
132
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 x = - -5 -6 -3 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 -12 -1 Iy(i-1,j) G C A F(i-1,j) e=-1 d=-5 Iy(i,j) Ix
133
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 x = - -5 -6 -3 -12 -13 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C F(i-1, j-1) Iy(i-1, j-1) x = Iy(i-1,j) -5 -6 -7 - -3 -4 -12 -1 (xi, yj) F(i-1,j) G C A e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix
134
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -5 x = - -5 -6 -3 -12 -13 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C F(i-1, j-1) Iy(i-1, j-1) x = Iy(i-1,j) -5 -6 -7 - -3 -4 -12 -1 -13 -10 (xi, yj) F(i-1,j) G C A e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix
135
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 -5 x = - -5 -6 -3 -12 -13 -7 -8 -1 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 -12 -1 -13 -10 Iy(i-1,j) G C A F(i-1,j) e=-1 d=-5 Iy(i,j) Ix
136
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 -5 x = - -5 -6 -3 -12 -13 -7 -8 -1 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 -12 -1 -13 -10 Iy(i-1,j) G C A F(i-1,j) e=-1 d=-5 Iy(i,j) Ix
137
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 -5 -9 -6 1 x = - -5 -6 -3 -12 -13 -7 -8 -1 -2 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C F(i-1, j-1) Iy(i-1, j-1) x = Iy(i-1,j) -5 -6 -7 - -3 -4 -12 -1 -13 -10 -14 -11 (xi, yj) F(i-1,j) G C A e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix
138
y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F
- 2 -7 -8 4 -1 -5 -9 -6 1 x = - -5 -6 -3 -12 -13 -7 -8 -1 -2 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C F(i-1, j-1) Iy(i-1, j-1) x = Iy(i-1,j) -5 -6 -7 - -3 -4 -12 -1 -13 -10 -14 -11 (xi, yj) F(i-1,j) G C A e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix
139
GCAC || | GC-C y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1
- 2 -7 -8 4 -1 -5 -9 -6 1 x = - -5 -6 -3 -12 -13 -7 -8 -1 -2 m = 2 s = -2 d = -5 e = -1 G C A G C A x GCAC || | GC-C y F Iy y = G C C y = G C C x = -5 -6 -7 - -3 -4 -12 -1 -13 -10 -14 -11 x = G C A G C A Ix
140
Statistics of alignment
Where does (xi, yj) come from? Are two aligned sequences actually related?
141
Probabilistic model of alignments
We’ll first focus on protein alignments without gaps Given an alignment, we can consider two possible models R: the sequences are related by evolution U: the sequences are unrelated How can we distinguish these two models? How is this view related to amino-acid substitution matrix?
142
Model for unrelated sequences
Assume each position of the alignment is independently sampled from some distribution of amino acids ps: probability of amino acid s in the sequences Probability of seeing an amino acid s aligned to an amino acid t by chance is Pr(s, t | U) = ps * pt Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is i
143
Model for related sequences
Assume each pair of aligned amino acids evolved from a common ancestor Let qst be the probability that amino acid s in one sequence is related to t in another sequence The probability of an alignment of x and y is give by
144
Probabilistic model of Alignments
How can we decide which model (U or R) is more likely? One principled way is to consider the relative likelihood of the two models (the odds ratio) A higher ratio means that R is more likely than U
145
Log odds ratio Taking logarithm, we get
Recall that the score of an alignment is given by
146
Therefore, if we define We are actually defining the alignment score as the log odds ratio between the two models R and U
147
How to get the probabilities?
ps can be counted from the available protein sequences But how do we get qst? (the probability that s and t have a common ancestor) Counted from trusted alignments of related sequences
148
Protein Substitution Matrices
Two popular sets of matrices for protein sequences PAM matrices [Dayhoff et al, 1978] Better for aligning closely related sequences BLOSUM matrices [Henikoff & Henikoff, 1992] For both closely or remotely related sequences
149
BLOSUM-N matrices Constructed from a database called BLOCKS
Contain many closely related sequences Conserved amino acids may be over-counted N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity identity: % of matched columns Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)
150
: Scaling factor to convert score to integer.
Important: when you are told that a scoring matrix is in half-bits => = ½ ln2 Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights
151
BLOSUM-N matrices If you want to detect homologous genes with high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75 On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 BLOSUM-62: good for most purposes Weak homology Strong homology
152
For DNAs No database of trusted alignments to start with
Specify the percentage identity you would like to detect You can then get the substitution matrix by some calculation
153
For example Suppose pA = pC = pT = pG = 0.25 We want 88% identity
qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 (s, t) = log (0.01 / (0.25*0.25)) = for s ≠ t.
154
Substitution matrix A C G T 1.26 -1.83
155
A C G T 5 -7 Scale won’t change the alignment Multiply by 4 and then round off to get integers
156
Arbitrary substitution matrix
Say you have a substitution matrix provided by someone It’s important to know what you are actually looking for when you use the matrix
157
Which one should I use for my sequences?
NCBI-BLAST WU-BLAST A C G T 1 -2 A C G T 5 -4 What’s the difference? Which one should I use for my sequences?
158
We had Scale it, so that Reorganize:
159
Since all probabilities must sum to 1,
We have Suppose again ps = 0.25 for any s We know (s, t) from the substitution matrix We can solve the equation for λ Plug λ into to get qst
160
A C G T 1 -2 A C G T 5 -4 Translate: 95% identity
NCBI-BLAST WU-BLAST A C G T 1 -2 A C G T 5 -4 = 1.33 qst = 0.24 for s = t, and for s ≠ t Translate: 95% identity = 0.19 qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity
161
Details for solving A C G T 1 -2
Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t. Since and s,t qst = 1, we have 12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1 Let e = x, we have ¾ x-2 + ¼ x = 1. Hence, x3 – 4x2 + 3 = 0; X has three solutions: 3.8, 1, -0.8 Only the first solution leads to a positive = ln (3.8) = 1.33 A C G T 1 -2
162
Statistics of alignment
Where does (xi, yj) come from? Are two aligned sequences actually related?
163
Statistics of Alignment Scores
Q: How do we assess whether an alignment provides good evidence for homology (i.e., the two sequences are evolutionarily related)? Is a score 82 good? What about 180? A: determine how likely it is that such an alignment score would result from chance
164
P-value of alignment p-value Model-based vs simulation-based
The probability that the alignment score can be obtained from aligning random sequences Small p-value means the score is unlikely to happen by chance The most common thresholds are 0.01 and 0.05, also depend on purpose of comparison and cost of misclaim Model-based vs simulation-based
165
Statistics of global seq alignment
Theory only applies to local alignment For global alignment, your best bet is to do Monte-Carlo simulation What’s the chance you can get a score as high as the real alignment by aligning two random sequences? Procedure Given sequence X, Y Compute a global alignment (score = S) Randomly shuffle sequence X (or Y) N times, obtain X1, X2, …, XN Align each Xi with Y, (score = Ri) P-value: the fraction of Ri >= S
166
Human HEXA Fly HEXO1 Score = -74
167
-74 Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences There are 88 random sequences with alignment score >= -74. So: p-value = 88 / 200 = 0.44 => alignment is not significant
168
……………………………………………………
Mouse HEXA Human HEXA Score = 732 ……………………………………………………
169
No random sequences with alignment score >= 732
Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences 732 No random sequences with alignment score >= 732 So: the P-value is less than 1 / 200 = 0.05 To get smaller p-value, have to align more random sequences Very slow Unless we can fit a distribution (e.g. normal distribution) Such distribution may not be generalizable No theory exists for global alignment score distribution
170
Statistics for local alignment
Elegant theory exists Score for ungapped local alignment follows extreme value distribution (Gumbel distribution) Normal distribution Extreme value distribution An example extreme value distribution: Randomly sample 100 numbers from a normal distribution, and compute max Repeat 100 times. The max values will follow extreme value distribution
171
Statistics for local alignment
Given two unrelated sequences of lengths M, N Expected number of ungapped local alignments with score at least S can be calculated by E(S) = KMN exp[-S] Known as E-value : scaling factor as computed in last lecture K: empirical parameter ~ 0.1 Depend on sequence composition and substitution matrix
172
P-value for local alignment score
P-value for a local alignment with score S when P is small.
173
Example You are aligning two sequences, each has 1000 bases
m = 1, s = -1, d = -inf (ungapped alignment) You obtain a score 20 Is this score significant?
174
= ln3 = 1.1 (computed as discussed on slide #41)
E(S) = K MN exp{- S} E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5 P-value = 3 x 10-5 << 0.05 The alignment is significant Distribution of 1000 random sequence pairs 20
175
Multiple-testing problem
Searching a 1000-base sequence against a database of 106 sequences (each of length 1000) How significant is a score 20 now? You are essentially comparing 1000 bases with 1000x106 = 109 bases (ignore edge effect) E(20) = 0.1 * 1000 * 109 * 3-20 = 30 By chance we would expect to see 30 matches The P-value (probability of seeing at least one match with score >= 30) is 1 – e-30 = The alignment is not significant Caution: it does NOT mean that the two sequences are unrelated. Rather, it simply means that you have NO confidence to say whether the two sequences are related.
176
Score threshold to determine significance
You want a p-value that is very small (even after taking into consideration multiple-testing) What S will guarantee you a significant p-value? E(S) P(S) << 1 => KMN exp[-S] << 1 => log(KMN) -S < 0 => S > T + log(MN) / (T = log(K) / , usually small)
177
Score threshold to determine significance
In the previous example m = 1, s = -1, d = -inf => = 1.1 Aligning 1000bp vs 1000bp S > log(106) / 1.1 = 13. So 20 is significant. Searching 1000bp against 106 x 1000bp S > log(1012) / 1.1 = 25. so 20 is not significant.
178
Statistics for gapped local alignment
Theory not well developed Extreme value distribution works well empirically Need to estimate K and empirically Given the database and substitution matrix, generate some random sequence pairs Do local alignment Fit an extreme value distribution to obtain K and
179
Alignment statistics summary
How to obtain a substitution matrix? Obtain qst and ps from established alignments (for DNA: from your knowledge) Computing score: How to understand arbitrary substitution matrix? Solve function to obtain and target qst Which tells you what percent identity you are expecting How to understand alignment score? probability that a score can be expected from chance. Global alignment: Monte-Carlo simulation Local alignment: Extreme Value Distribution Estimate p-value from a score Determine a score threshold without computing a p-value
180
Part III: Heuristic Local Sequence Alignment: BLAST
181
State of biological databases
Sequenced Genomes: Human 109 Yeast 1.2107 Mouse 2.7109 Rat 2.6109 Neurospora 4107 Fugu fish 3.3108 Tetraodon 3108 Mosquito 2.8108 Drosophila 1.2108 Worm 1.0108 Rice 1.0109 Arabidopsis 1.2108 sea squirts 108 Current rate of sequencing (before new-generation sequencing): 4 big labs 3 109 bp /year/lab 10s small labs Private sectors With new-generation sequencing: Easily generating billions of reads daily
182
Some useful applications of alignments
Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: Our new gene 104 The entire genomic database May take several weeks!
183
Some useful applications of alignments
Given a newly sequenced organism, - Which subregions align with other organisms? - Potential genes - Other functional units Assume we try Smith-Waterman: Our newly sequenced mammal 3109 The entire genomic database > 1000 years ???
184
BLAST Basic Local Alignment Search Tool
Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 The most widely used bioinformatics tool Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? Score-wise, exactly equivalent Biologically, later may be more interesting, & is common At least, if must miss some, rather miss the former BLAST is a heuristic algorithm emphasizing the later speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed
185
BLAST Available at NCBI (National Center for Biotechnology Information) for download and online use. Along with many sequence databases Main idea: Construct a dictionary of all the words in the query Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB
186
BLAST Original Version
Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query …… scan DB query
187
BLAST Original Version
A C G A A G T A A G G T C C A G T Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A
188
Gapped BLAST Added features: Pairs of words can initiate alignment
A C G A A G T A A G G T C C A G T Added features: Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A
189
Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins]
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: tacaccccgattacaccccga 24 Sbjct: tacacccagattacaccccga >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Sbjct: tacacccagattacaccccga 3911
190
Example Query: Human atoh enhancer, 179 letters [1.5 min]
Result: 57 blast hits gi| |gb|AF |AF Homo sapiens ATOH1 enhanc e-95 gi| |gb|AC | Mus musculus Strain C57BL6/J ch e-68 gi| |gb|AF |AF Mus musculus Atoh1 enhanc e-66 gi| |gb|AF | Gallus gallus CATH1 (CATH1) gene e-12 gi| |emb|AL | Zebrafish DNA sequence from clo e-05 gi| |gb|AC | Oryza sativa chromosome 10 BAC O gi| |ref|NM_ | Mus musculus suppressor of Ty gi| |gb|BC | Mus musculus, Similar to suppres gi| |gb|AF |AF Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318
191
BLAST Score: bit score vs raw score
Bit score is converted from raw score by taking into account K and : S’ = ( S – log K) / log 2 To compute E-value from bit score: E = KM’N’ e-S = M’N’ 2-S’ Critical score is now: S* = log2(M’N’) If S’ >> S*: significant If S’ << S*: not significant (M’ ~ M, N’ ~ N)
192
Different types of BLAST
blastn: search nucleic acid databases blastp: search protein databases blastx: you give a nucleic acid sequence, search protein databases tblastn: you give a protein sequence, search nucleic acid databases tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences
193
BLAST cons and pros Advantages Disadvantages New improvement Fast!!!!
A few minutes to search a database of 1011 bases Disadvantages Sensitivity may be low Often misses weak homologies New improvement Make it even faster Mainly for aligning very similar sequences or really long sequences E.g. whole genome vs whole genome Make it more sensitive PSI-BLAST: iteratively add more homologous sequences PatternHunter: discontinuous seeds
194
Variants of BLAST NCBI-BLAST: most widely used version
WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower
195
Summary Part I: Algorithms Part II: Biological issues
Global sequence alignment: Needleman-Wunsch Local sequence alignment: Smith-Waterman Improvement on space and time Part II: Biological issues Model gaps more accurately: affine gap penalty Alignment statistics Part III: Heuristic algorithms – BLAST family
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.