Download presentation
Presentation is loading. Please wait.
1
Basic Local Alignment Search Tool
Presented by Mei Liu August 7, 2008
2
Introduction BLAST Finds regions of local similarity between sequences
Assesses which DNA or protein sequences in a large database have significant similarity with a given query sequence Infer functional and evolutionary relationships between sequences Help identify members of gene families Two implementations of BLAST: one by NCBI and the other at Washington University
5
Introduction WU-BLAST printouts give the following values
Score or High Score Bit scores Expect values P-values
6
Outline Comparison of two aligned sequences BLAST random walk
Parameter calculations Choice of score Bounds and approximation for BLAST p-value Normalized and bit scores Number of high-scoring excursions Karlin-Altschul sum statistic
7
Outline Comparison of two unaligned sequences
Comparison of a query sequence against a database Minimum significance lengths Parametric or non-parametric test? Gapped BLAST and PSI BLAST
8
1. Two Aligned Sequences Given an ungapped global alignment of two protein sequences, both of length N Null hypothesis: for each aligned pair of amino acids, the two amino acids are generated by independent mechanisms Null hypothesis probability of the amino acid pair (j, k) = Alternative hypothesis probability of the amino acid pair (j, k) = (10.1) (10.2)
9
1.1 BLAST Random Walk Number the positions from left to right as 1, 2, …, N A score S(j, k) is allocated to each aligned amino acid pair (j, k) In application of BLAST, the score is found by BLOSUM or PAM
10
1.1 BLAST Random Walk PAM Developed by Margaret Dayhoff in 1970s
calculated by observing the differences in closely related proteins PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed Derived matrices as high as PAM250 Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance Not work very well for aligning evolutionarily divergent sequences
11
1.1 BLAST Random Walk BLOSUM
Henikoff and Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins Probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments To reduce bias from closely related sequences, segments in a block with a sequence identity above a certain threshold were clustered For the BLOSUM62, this threshold was set at 62% Larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity
12
1.1 BLAST Random Walk
13
1.1 BLAST Random Walk Accumulated score at position i is calculated as the sum of scores for various amino acid comparisons at positions 1, 2, … , i Sequence 1: T Q L A A W C R M T C F E I E C K V Sequence 2: R H L D S W R R A S D D A R I E E G S(j, k): -1, 1, 5, -2, 1, 15, -4, 7, -1, 2, -4, etc. Accumulated Score: -1, 0, 5, 3, 4, 19, 15, 22, etc.
14
1.1 BLAST Random Walk Let Y1, Y2, … be the respective maximum heights of the walk relative to the height of any ladder point after leaving this ladder point and before arriving at the next Define Ymax as the maximum of these maxima Ymax is the test statistic used in BLAST, so it is necessary to find its null hypothesis distribution Random variables Yi exhibit geometric-like distribution C and depends on the substitution matrix used and amino acid frequencies { } and { } Probability distribution of Ymax, apart from C and , also depends on the mean number of ladder points in the walk
15
1.2 Parameter Calculations
Step size is identified with a score S(j,k) Null hypothesis probability of taking a step of any size is found from the two sets of frequencies { } and { } When null hypothesis is true, can be calculated (7.61) (10.3)
16
1.2 Parameter Calculations
Ymax depends on C, , and mean number of ladder points in BLAST walk Mean number of ladder points in turn depends on the distance A between ladder points Calculation of A depends on the calculation of R-j Two alternative approaches in calculation (7.41)
17
1.2 Parameter Calculations
Decomposition of paths Ex. A walk with 2 possible steps: +1, -2 with respective probabilities p, q=1-p Any ladder point reached in the walk is at a distance 1 or 2 below the previous one Respective probabilities of the two cases are R-1 and R-2 = 1 – R-1 Probability that -2 is a ladder point is: Probability that it goes to -2 immediately, and Probability that it first goes to +1 reaches 0 -2
18
1.2 Parameter Calculations
(10.4) Directly -2 +1 -2 (10.5) Then value of A follows from Eq. (7.41) Since two sequences compared are each of length N, and mean distance between ladder points is A The mean number of ladder points is N/A (7.41)
19
1.3 Choice of a Score BLAST score is a log likelihood ratio Why?
Similar to sequence analysis If random variable Y has a discrete probability distribution, this “score” statistic is defined as the log likelihood ratio If amino acid pair (j,k) is observed at any position, and if pjpk' and q(j,k) are null and alternative hypothesis probabilities (10.6)
20
1.3 Choice of a Score Second argument leads to the choice of a specific proportionality constant Suppose some arbitrary substitution matrix is chosen with (j,k) element S(j,k), let q(j,k) be defined implicitly by where is defined in equation (10.3) Thus q(j,k) can be defined explicitly by (10.7) (10.3) (10.8)
21
1.3 Choice of a Score Karlin and Altschul (1990) and Karlin (1994) showed that When null hypothesis is true, the frequency with which the observation (j,k) arises in high-scoring excursions is asymptotically equal to q(j,k) Then argued that a score scheme is “optimal” if the frequency of the observation (j,k) in high-scoring excursions is asymptotically equal to the “target” frequency q(j,k), the frequency arising if the alternative hypothesis is true i.e. frequency in the most biologically relevant alignments of conserved regions
22
1.3 Choice of a Score Argument for the use of S(j,k) as the score statistic lead to following procedures: Various possibilities for q(j,k) One frequently adopted choice is derived from the evolutionary arguments that lead to PAMn matrix construction in 6.5.3 (10.7) (10.9) (10.10)
23
1.3 Choice of a Score Choice of S(j,k) can as be related to relative entropy Score defined is proportional to the support given by the observation (j,k) in favor of the alternative hypothesis over the null hypothesis Eq shows that when the alternative hypothesis is true, the mean support for the alternative over the null hypothesis is (10.7) (10.11) (10.12)
24
1.3 Choice of a Score Mean score in high-scoring segments is asymptotically (10.7) (10.12) (10.13)
25
1.3 Choice of a Score Simulations show that the convergence to this asymptotic value is very slow Direct computation of H is not possible and S(j,k) are known, but q(j,k) is unknown BLAST uses indirect approach to calculate H where q(j,k) is first calculated by (10.12) (10.8)
26
1.4 Bounds and Approximation for BLAST P-value
Test statistic used in BLAST is the maximum Ymax of n ≈ N/A random variables Each being a random upwards excursion height following a ladder point in the BLAST random walk In section 7.6.4, it was shown that each upward excursion has the geometric-like distribution Obtain asymptotic bounds for the null hypothesis distribution of Ymax and hence asymptotic bounds for a BLAST P-value
27
1.4 Bounds and Approximation for BLAST P-value
There exists an asymptotic distribution for the maximum of n iid continuous random variables whose density function has support of the form (A, +∞) However, Ymax is a discrete random variable Use the continuous distribution results to find asymptotic bounds for the distribution of Ymax If Xmax is the max of n iid continuous r.v. and if Ymax = floor(Xmax), then Ymax is a discrete r.v. Thus, for any positive integer y (10.14)
28
1.4 Bounds and Approximation for BLAST P-value
Let Xmax be the max of n iid r.v. each having exponential distribution and Ymax = Floor(Xmax) Ymax has the same distribution as the max of n iid r.v. each having geometric distribution Applying Eq. (2.130) and bounds in (10.14), we have a close approximation (10.15) (10.16) (10.17)
29
1.4 Bounds and Approximation for BLAST P-value
If we replace n by N/A for the mean number of BLAST ladder points and define a new parameter K by The inequality (10.17) becomes If replace y by x+-1logN, we have (10.18) (10.19) (10.20) (10.21) (10.22)
30
1.4 Bounds and Approximation for BLAST P-value
These bounds for BLAST P-value are not directly relevant in practice because BLAST search involves comparison of short query sequence with a large DB with many fragments No a priori alignment Nevertheless, P-value approximation derives ultimately from the lower P-value bound in Eq. (10.22) More appropriate to use conservative (overestimate the true P-value) upper bound in (10.22) rather than lower bound
31
1.5 Normalized and Bit Scores
Karlin and Altschul (1993) call the following expression a “normalized score” In terms of this score, the inequalities (10.20) can be written as From the upper inequality P-value corresponding to an observed value s' is (10.25) (10.26) (10.27) (10.28)
32
1.5 Normalized and Bit Scores
BLAST record a score similar to the normalized score S', namely the “bit” score defined by
33
1.6 Number of High-Scoring Excursions
Quantity E' = quantity “Expect” in BLAST Under null hypothesis, for each excursion, the maximum height Y has a geometric-like distribution # of excursions = N/A In BLAST, mean number of excursions reaching a height v or more is approximately (10.18) (10.34)
34
1.6 Number of High-Scoring Excursions
Expected value of the number of excursions corresponding to the observed maximal score ymax (10.35) (10.36) (10.37)
35
1.7 Karlin-Altschul Sum Statistic
Focusing on Ymax loses information provided by heights of the 2nd, 3rd, etc. excursions in the random walk Consider r largest Yi values Compute r normalized scores where (10.38)
36
1.7 Karlin-Altschul Sum Statistic
Karlin and Altschul (1993) showed that to a close approximation, the null hypothesis joint density function is Any reasonable function of can be the test statistic Use transformation methods introduced in Chap. 2 to find the distribution of this test statistic In turn allows computations of P-value and E or Expect value corresponding to any observed value of this statistic (10.39)
37
1.7 Karlin-Altschul Sum Statistic
Statistic suggested is the sum of the normalized scores, called the Karlin-Altschul sum statistic Null hypothesis density function f(t) of Tr When t is sufficiently large, this density function can be used to find the approximate expression (10.40) (10.41)
38
1.7 Karlin-Altschul Sum Statistic
The approximation (10.41) is sufficiently accurate when t > r(r+1), and BLAST uses it when the inequality holds If t is the observed value of Tr, the right hand side in (10.41) provides the approximate P-value corresponding to this observed value This is used as a component of the eventual BLAST printout P-value Ex. s1 = 4.4 and s2 = 2.5 r = 1, P-value for the highest normalized score 4.4 = e-4.4 = 0.012 r = 2, P-value for the sum 6.9 = 6.9/2 * e-6.9 = (10.41)
39
2. Two Unaligned Sequences
Given two sequences of lengths N1 and N2, but no specific alignment is given Need to find the significance of high-scoring segment pairs between all possible (ungapped) local alignments
40
2.1 Theoretical and Empirical Background
BLAST considers all ungapped alignments determined by all possible relative positions of two sequences For each relative position, alignment is extended as far as possible in either direction, giving a total of N1+N2-1 ungapped alignments
41
2.1 Theoretical and Empirical Background
Each alignment yields a random walk Total N1N2 comparisons between two sequences taking all possible positions relative to each other Many conclusions from previous section can be carried over to the present case with N replaced by N1N2 or a more refined function allowing for edge effects
42
2.1 Theoretical and Empirical Background
Ymax is the maximum score achieved in the random walk comparing sequences, using all possible ungapped local alignments Mean number of ladder points: Assume null hypothesis is true, inequalities in (10.21) is replaced by Normalized score S' is redefined as Expected number E' of excursions reaching a height ymax or more is Null hypothesis mean of Ymax is (10.42) (10.43) (10.44) (10.45) (10.46)
43
2.2 Edge Effects A high-scoring random walk excursion might be cut short at the end of a sequence match So the height of high-scoring excursions and the number of such excursions will be less than predicted by theory Edge effects is an important factor in the comparison of two comparatively short sequences BLAST theory concerns two long sequences In practice, BLAST considers databases of large number of short sequences
44
2.2 Edge Effects BLAST calculations allow for edge effects by subtracting from both N1 and N2 a factor depending on the mean length of any high-scoring excursion Eq. (10.13) showed that the mean value of the step in high-scoring excursion asymptotically approaches Given the height achieved by a high-scoring excursion is denoted by y, the mean length E(L|y) of this excursion, conditional on y, is BLAST theory replaces N1 and N2 by (10.47)
45
2.2 Edge Effects Specifically, the normalized score is replaced by
Expected number of excursions scoring v or higher is replaced by E' is given by (10.48) (10.49) (10.50) (10.51)
46
2.2 Edge Effects The use of edge correction in (10.49) assumes that asymptotic formula for the mean step size in a high-scoring excursion is appropriate Values calculated from Eq. (10.47) is inaccurate for anything other than very large values of N Use of edge correction in (10.49) might in practice lead to P-value estimates less than the correct values for anything other than very large N (10.47)
47
2.2 Edge Effects In BLAST, edge effect correction factor for the Karlin-Altschul sum statistic Tr is calculated as follows Raw edge effect correction is calculated as Edge correction value E(L) is defined by f is an “overlap adjustment factor” that can be chosen by the user Default f = implies that overlaps between segments of up to 12.5% are allowed (10.52)
48
2.3 Multiple Testing No obvious choice for the value of r
BLAST considers all r = 1, 2, 3, … and choose the set of HSPs with lowest sum statistic P-value as the most significant However, it implies that a sequence of tests, one for each r So issue of multiple testing arises Ignoring multiple testing issue can lead to a significant overestimate of BLAST P-values Unfortunately, no rigorous theory available to deal with this issue In practice, it is handled in an ad hoc manner
49
2.3 Multiple Testing Ex. WU-BLAST
P-value is adjusted by dividing by a factor When r = 1, the factor became 1- π, which implies that E' is divided by 1- π BLAST default value 0.5 of π implies that E=2E', so that P-value is then found as (10.56) (10.57)
50
3. Query Sequence vs. Database
Compare query sequence to each database sequence to obtain P-values for individual comparisons For r = 1, probability that in a match with score v or more is Expect, the mean number of HSPs scoring v or more in the entire database is given by D = total length of DB (sum of lengths of all database sequences) N2 = length of the database sequence (10.58) (10.59) (10.60)
51
3. Query Sequence vs. Database
For r > 1, from each P-value, a total database value of Expect is calculated by Finally, all single (r = 1) HSPs or summed (r > 1) HSPs with sufficiently low values of Expect are listed (10.61) (10.60)
52
4. Minimum Significance Lengths
Correct Choice of n When sequences are distantly related, similarities between them might be subtle Cannot detect significant similarity unless a long alignment is available On the other hand, if sequences are very similar, then a relatively short alignment is sufficient If the similarity is subtle, each aligned pair will tell us less than an aligned pair in more similar sequences (in terms of information) This lead to the concept of information content per position in an alignment
53
4. Minimum Significance Lengths
Using a PAMn matrix is to test: Alternative hypothesis: n is the correct value to use in the evolutionary process leading to the two protein sequences Null hypothesis: appropriate value of n is +∞ Here, assume that the alternative hypothesis is correct (i.e. correct value of n is chosen) Explore aspects of power of the testing procedure by finding the mean length of protein sequence needed before the alternative hypothesis is accepted
54
4. Minimum Significance Lengths
Suppose that, we decide to adopt a testing procedure with Type I error α (FP) The value s of the normalized score statistic S' is given by s = -logα Corresponding value ymax of Ymax is When alternative hypothesis is true, mean score for the amino acids comparison at any position is (10.64) (10.65)
55
4. Minimum Significance Lengths
In Chapter 7, it showed that if Mean final position in a random walk is F Mean step size is G Then mean number of steps needed to reach the final position is F/G Mean sequence length needed in the maximally scoring local alignment in order to obtain significance with Type I error α is (10.66)
56
4. Minimum Significance Lengths
Since various components can be interpreted in terms of bits of information, thus write the ratio (10.66) as Denominator = mean of the relative support, in terms of bits, provided by one observation for the alternative hypothesis against the null hypothesis, given that the alternative hypothesis is true Numerator = mean total number of bits of information needed to claim that two sequences are similar (10.67)
57
4. Minimum Significance Lengths
It is known that typically K = 0.1, α = 0.05 or 0.01 Thus numerator is largely determined by length N, which is approximately log2N Ex. N = 1000, need 9.97 bits of information to claim significant similarity between two sequences Main interest is the minimum significant length
58
4. Minimum Significance Lengths
If n is large, q(j,k) is close to pjpk Mean information per aligned pair given in the denominator is small Minimum significant length is large If null and alternative hypotheses specify quite similar probabilities for any aligned pair, many observations will in general be needed to decide between two hypotheses If n is small Mean relative support for the alternative hypothesis is large Minimum significant length is small
59
4. Minimum Significance Lengths
Limiting (n 0) values q(j,j) = pj q(j,k) = 0 for j ≠ k Denominator, mean support from each position in favor of the alternative hypothesis, approaches If all amino acids are equally frequent, this mean support is log220 = 4.32 In practice, actual frequencies of observed amino acids imply that a more appropriate value is about 4.17 Thus, minimum significant length is (log2N)/4.17 If N = 1000, this is about 2.39
60
4. Minimum Significance Lengths
When N = 1000 and n = 250 Corresponds to a PAM250 substitution matrix Probabilities q(j,k) are such that each amino acid pair provides a mean of only 0.36 bits of information Minimum significance length is log(1000)/0.36 = 28 is required on average to accept the alternative hypothesis
61
4. Minimum Significance Lengths
Incorrect Choice of n Above calculations all assume the correct value of n is chosen, thus correct alternative hypothesis probabilities q(j,k) is used In practice, it is impossible to choose a unique correct value for n when using a PAM matrix Suppose there is a unique correct value m leading to a PAMm matrix, but an incorrect value n was chosen and PAMn matrix is used instead What does this imply?
62
4. Minimum Significance Lengths
Suppose that with the correct choice m, the probability of the ordered pair (j,k) is r(j,k) The mean score is then r(j,k) = q(j,k) when n = m, mean score is positive More generally, mean score is positive when n and m are close But, as m +∞, r(j,k) pjpk, mean score is negative Thus for any choice of n there will be values of m sufficiently large compared to n so that the mean score is negative (10.68)
63
4. Minimum Significance Lengths
When mean score is positive, minimal significance length is Minimal length depends on q(j,k), that is on the choice of n Choice of n involves substantial extrinsic guesswork, thus it is important to assess the implications of an incorrect choice (10.69)
64
4. Minimum Significance Lengths
Negative means arise when m is sufficiently large compared to n, that is When two species being compared diverged a long time in the past relative to the time assumed by the PAM matrix used in analysis The more negative this mean is, the more likely that the null hypothesis will be accepted In the limit m +∞, when r(j,k) = pjpk, the probability of rejecting the null hypothesis is equal to the chosen Type I error Ex. If n = 100 is chosen, the mean score is negative when m is 193 or more
65
4. Minimum Significance Lengths
In conclusion, Correctly chosen small value of n leads to shorter minimal significance lengths Incorrect small choice may lead to the possibility that a real similarity between the two sequences will not be picked up In practice, to overcome this problem, sometimes uses a variety of substitution matrices However, it must be viewed with some caution, especially in the light of multiple testing problem
66
5. Parametric or Non-parametric
Parametric test: test statistic is found from likelihood ratio arguments Non-parametric test: test statistic is found on reasonable but nevertheless arbitrary grounds Many of calculations and arguments used in preceding sections derive from the derivation of the score S(j,k) in a substitution matrix from likelihood ratio arguments In this sense, BLAST testing theory can be thought of as a parametric procedure deriving from the likelihood ratio theory
67
5. Parametric or Non-parametric
Assumptions made in the theory are, however, subject to debate Time homogeneity assumption implicit in calculations cannot be sustained Genetic code influenced substitutions earlier in time and various chemical properties influenced substitutions more recently Thus, comparisons of distantly related species can be problematic Further, if data in a large database come from a collection of species whose respective evolutionary divergence times might differ widely, the concept of a uniformly correct choice of n is not meaningful
68
5. Parametric or Non-parametric
Even if these claims are true, the statistical aspects of the BLAST procedure are still valid P-value calculations are still correct, so even if these scores were chosen in any more or less reasonable way, no problems arise with the correctness of the calculations In this sense, BLAST testing process can be thought of as a non-parametric procedure
69
6.1 Gapped BLAST Allows gaps in sequence alignments
In comparison of two sequences, there will be some maximum scores Maximum score over all possible gapped alignments Null hypothesis probability distribution is determined by the substitution matrix used and gap penalty chosen The distribution can be estimated through simulation Randomly generate two sequences of lengths N1 and N2 From these sequences, find the observed maximum score denoted by y1 Procedure is repeated n times yielding n observed highest scores y1, y2, …, yn
70
6.1 Gapped BLAST Approximation was made that the distribution of Ymax in gapped case is of the same form in the ungapped case with revised values of K and Approach described above depends on simulation results If a penalty of δ is assigned to each gap in the alignment of two sequences, then (10.45) is replaced by (10.72) (10.73)
71
6.2 PSI BLAST PSI (Position Specific Iterated) BLAST
In regular BLAST, a fixed substitution matrix is used to score positions in alignments It relies on one matrix to provide the most meaningful scores for all positions in the query sequence simultaneously PSI-BLAST Uses a standard substitution matrix in the first step Sequences found are then used to derived a separate scoring scheme for each position in the query sequence and used for the second BLAST search The procedure is iterated until no further iteration seems useful
72
6.2 PSI BLAST Query sequence is first compared to database sequences
All database sequence segments having a sufficiently close similarity with the query (ex. Expect < 0.01) are reported From this collection of sites, a frequency fi of amino acid i is calculated, and used to estimate frequency Qi of amino acid i at this site In PSI-BLAST, Σigi = 1 no longer holds Shaffer et al. (2001) described a new implementation where pi is the background frequency of amino acid i and p(i,j) is the frequency with amino acids i and j aligned through evolutionary descent (10.75) (10.76)
73
Any questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.