BLAST: A Case Study Lecture 25
BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters. BLAST was developed to find sequences of nucleotides or amino acids in a database that match a query sequence. For example, searching the human genome for AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT produces a list of sequences scored by similarity. This system helps scientists find genetic homologues across individuals and species.
Using BLAST There are several interfaces to BLAST, and it often appears as one component of a larger suite of informatics tools. National Center for Biotechnology Information (NCBI) hosts the primary website and a server farm dedicated to BLAST. From here, a user enters a query, selects a database, chooses a variant of BLAST to use, and sets program parameters Results appear in seconds.
BLAST Results The NCBI BLAST tool returns results in several modes, with information centered around similarity scores. In addition to a list of matches, the tool returns a graphical view of the list that visualizes the alignments, a detailed textual view of each match, and a mapping of the matches to a visual representation of an entire genome.
How BLAST Works (Stage 1) The core BLAST algorithm has three distinct stages. In the first stage, the system splits the query sequence into constant-sized words. Assuming the constant, W, is 4, the nucleotide query AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT produces the words AGCT GCTT CTTT … GCCT CCTT CTTT BLAST matches these against every possible four letter word from the language to build similarity scores. The subset of words whose similarity scores exceed a threshold move on to later stages, the rest are discarded.
Side Note: Similarity in BLAST To score the similarity of two words, BLAST builds a table based on edit distances. For example, comparing AGCT to ACCC could give a score of 1, whereas comparing it to GGCT would give 3. However, some substitutions (due to mutation) are more likely than others, especially in the case of amino acids. BLAST accepts a scoring matrix for protein strings (e.g., Point Accepted Mutations 70). For nucleotide strings, users can specify distinct scores for matches and mismatches. BLAST also includes procedures for identifying and penalizing gaps.
How BLAST Works (Stages 2 and 3) At this point, BLAST has built a set of W-length words that exceed a user-provided threshold. During the second stage, the system searches for all occurrences of these words within the database. In the third stage, BLAST extends each of these W-length matches to get the final similarity score. The system also calculates the E-value for the score, which is a statistical measure of significance.
Knowledge and Search in BLAST BLAST differs from many of the informatics tools that we have considered in the course. Essentially it finds a sequence’s nearest neighbors within a database with minimal concern for the content. Unlike discovery or analysis tools, BLAST gathers information and leaves the interpretation to the user. However, like many discovery tools, BLAST relies on domain knowledge to carry out heuristic search. Knowledge:match/mismatch costs for amino acid and nucleotide sequences Heuristic Search: an approximate scoring scheme, tells BLAST where to look more closely
What Makes BLAST a Successful Tool? Google Scholar identifies over 28,000 citations of the original BLAST paper. One of the key reasons for the system’s popularity is that it addresses problems commonly encountered in biology: finding genetic homologues across organisms; and determining the source organism of a sequenced genome (e.g., the Global Ocean Sampling Expedition). Technical issues also contributed to BLAST’s success: it was much faster than competing software; it was distributed and maintained by the National Institute of Health; it has continually evolved to meet new challenges and to integrate with new databases and other technologies.
BLAST: Summary A key insight in BLAST was to iteratively refine a solution: find a reduced set of short words to use as a heuristic for locating similar strings; find matches to those short words and extend them to refine the candidate solution. This strategy accounts for the computational gains that this system makes over others that seek exact comparisons. The continued success of BLAST is attributable to the speed in which it can find sequence matches, its availability over the internet, its integration with other biological tools, and the fact that it addresses a specific need of biologists.