BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier
What is BLAST? BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence against a database. Smith-Waterman is rigorous and it is guaranteed to find an optimal alignment. But also time and space consuming. It is especially inefficient in database searches. BLAST provides a rapid alternative.o S-W
Why do we use BLAST? To understanding the relatedness of any protein or DNA sequence (query sequence) to other known sequences (database) Identify sequences with a common ancestor (orthologs) and paralogs Discover new genes or proteins Explore protein structure and function …
The BLAST Algorithm S. F. Altschul, et al., 1997, Nucleic Acids Research, 25:3389 “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)
How the original BLAST algorithm works: Step 1. size w words in the query sequence Look at the query sequence by a moving window of size w Example: for a human RBP query …FSGTWYA… (query word is in yellow) The moving window of words: FSG SGT GTW TWY WYA page 101
Step 1: compile a list of words scoring at least T with query word GTW 6,5,11 22 ASW 6,1,11 18 word hitsATW 0,5,1116 > threshold NTW 0,5,1116 GTY 6,5,213 GNW10 GAW9 word hits < threshold (T=11)
3. Extend: when you manage to find a “hit” extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) Hit! extend 2. Scan the database for entries that contains any word from the compiled hit list.
Alignment Score It is important to assess the statistical significance of search results. For local alignments (including BLAST search results), the scores follow an extreme value distribution (EVD). E-value is closely related to the analysis of the distribution of alignment score Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87: (PubMed)(PubMed)
Alignment score as a random walk
Max Score in an Excursion P i : frequency of residue i in 1 st seq. P k ’ : frequency of residue k in 2 nd seq.
Protein Scoring
Dist’n of any excursion Y an exponential distribution
The Max of n variables Y 1, Y 2, …, Y n are identical & independently distributed Y max is the max of the above all. Then Prob( Y max<y)=(Prob(Y<y)) n
The Max of n Exponential variables Y 1, Y 2, …, Y n are independent exponential variables Y max is the max of the above all. Then Prob( Y max<y)=(Prob(Y<y)) n =(1-e -λy ) n Prob( Y max>y)=1-(Prob(Y<y)) n =1-(1-e -λy ) n
In a database of n seq.s Number of sequences: n Y 1, Y 2,…, Y n are i.i.d. exponential What happens when n is large? Using a widely used rule: (1+x/n) n exp(x) 1-exp(-nCe λy ) Probability of scores/excursion higher than y The distribution of Y max follows an extreme value dist’n
x probability normal distribution the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution,
x probability extreme value distribution normal distribution the maximum of a large number of i.i.d. random variables tends to an extreme value distribution
Expected number of better scores/higher excursions 1-exp(-nCe λy ) = E-value p-value
E is the number of hits you would expect from your search with scores greater than S where: K is a constant m is the size of the query n is the size of the database being searched scales for the specific scoring matrix used (decay constant from the extreme value distribution) E = Kmn e - S
Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep=1-exp(-E) (about 0.1) (about 0.05) (about 0.001) How to interpret BLAST: E values and p values
End
Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor.
Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x=0.
Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x = 0. Solution: exp[-1] = 0.368
An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p-value associated with 45?
An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p-value associated with 45?
Another example You run BLAST and get a score of 23. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 20 and λ = What is the p-value associated with 23?
Another example You run BLAST and get a score of 23. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 20 and λ = What is the p-value associated with 23?
BLAST: optional parameters You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size change the output format
Choosing Gap Penalty 1. Choice must be made corresponding to each type of scoring system to place gaps where they will increase the overall alignment score. 2. There are no hard and fast rules for choosing gap penalties. 3. Both g (opening penalty) and r (extension penalty) should be non-zero. 4. The value of g + r should be greater than the maximum score used for a match if insertions and deletions are considered to be rarer than nucleotide substitutions. 5. The value of g strongly influences the number of gaps introduced into a region separating two closely matching regions.
Comparing Scoring Matrix PAM Homologous seq.s during evolution Based on extrapolation of a small evol. Period Track evolutionary origins BLOSUM Conserved blocks Based on a range of evol. Periods Find conserved domains
Another way to compare perform a search of a sequence database with a known member of a protein family and to find how many members of the family are found. When gap penalty was not considered, the BLOSUM62 matrix outperformed the PAM250 matrix in finding more members of 504 different families on the Prosite database.
Extension: In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search. BLAST Phase 3
How a BLAST search works: threshold You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options.
The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e - S How to interpret a BLAST search: expect value
This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the expected number of HSPs with a score >= S m, n = the length of two sequences, K = Karlin Altschul statistics E = Kmn e - S
Some properties of the equation E = Kmn e - S The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly
From raw scores to bit scores There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.
The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e - How to interpret BLAST: E values and p values