Download presentation
Presentation is loading. Please wait.
1
BLAST
2
What is BLAST? Basic Local Alignment Search Tool
Calculates similarity for biological sequences. Produces local alignments: only a portion of each sequence must be aligned. Uses statistical theory to determine if a match might have occurred by chance. Similarity is not homology, things may be % similar, but they are either homologous or not. Local aligns active sites on proteins, important since most proteins are modular in nature. A global alignment does not take this into account and similarities may be missed. Statistical theory very important as it tells us whether or not an alignment occurred just by chance.
3
BLAST is a heuristic. A lookup table is made of all the “words” (short subsequences) and “neighboring” words in the query sequence. The database is scanned for matching words (“hot spots”). Gapped and un-gapped extensions are initiated from these matches. BLAST is about 100 times faster than exhaustive programs like Smith-Waterman
4
Here the word is PQG and neighboring words are everything with a score above 13 (for three letters) as calculated by the given scoring system (e.g., BLOSUM62). PSG is a neighboring word, PQA is not.
5
BLAST reports at the NCBI Web page.
Here a protein sequence in FASTA has been entered and the Swiss-prot database selected on the protein-protein (blastp) page.
6
Formatting Page This page appears after the last page is sent off and includes a Request Identifier (RID) as well as the results of a Conserved Domain Database (CDD) search. Clicking on “Format!” checks for the results.
7
Graphical Overview The graphical overview shows the database hits aligned underneath the query sequence (top red bar). Also on this slide is information about the query and the database searched as well as a link to TaxBlast.
8
One-line descriptions
The one-line descriptions consist of four fields: identifier (e.g., gi|116365|sp|P26374|RAE2_HUMAN), a (truncated) definition line, a bit score, and an expect value (false positive rate).
9
Expect value The number of alignments expected by chance with a given score. The larger the database, the more alignments are expected at a given score “Errors per Query” or “FALSE Positive rate”.
10
Pair-wise alignments This is a pair-wise alignment view, part of the query is aligned to one database sequence. In this slide two alignments to the same database sequence are shown. Shown is also the full definition line and the database sequence length as well as some statistics about number of identical letters, etc.
11
BLOSUM62 matrix
13
Query-anchored alignments
This is query-anchored. The query sequence is shown aligned with all the database sequences that match to that portion of the query. A dot (“.”) here indicates that the residue is conserved. Note the dinucleotide binding motif from bases (Koonin et al., Nature Genetics 12, page 237 (1996)), which consists of bulky hydrophobics and then a glycine rich region. The query-anchored alignments makes it easy to spot others in the database.
14
Future improvements: LinkOut, taxonomic and structure links.
Link to taxonomy Link to Locus-link Link to UniGene This format is still under discussion. The new version of the BLAST databases contains a flag for GI’s that have entries in Locus-link or UniGene as well as fielded taxonomic information. A flag will also soon be set if a GI has an associated structure that can be viewed with Cn3D and an extra link may be added.
15
BLAST report designed for human readability.
One-line descriptions provide overview designed for human “browsing”. Redundant information is presented in the report (e.g., one-line descriptions and alignments both contain expect values, scores, descriptions) so a user does not need to move back and forth between sections. HTML version has lots of links for a user to explore. It can change as new features/information becomes available. The BLAST report has been changed or extended for algorithmic changes. PSI-BLAST needed to show on subsequent iterations which hits were new; for gapped BLAST we added the number of gaps in an alignment. We’ve also made changes at user request (e.g., hyper-links for other than first GI in an entry with redundant GI’s).
16
Hit-table Contains no sequence or definition lines, but does contain sequence identifiers, starts/stops (one-offset), percent identity of match as well as expect value etc. Simple format is ideal for automated tasks such as screening of sequence for contamination or sequence assembly. The lines starting with “#” should be considered comments and ignored. The last “#” line lists the fields in the Table. The BLAST report contains more information than needed (sequence, definition lines) for most tasks that involves no manual inspection of results.
18
Using a filter (SEG) on a query.
19
What is an ALU Constitutes about 5% of the human genome.
Short interspersed repeats. Found in primate genomes. ALU elements often found in 3’ regions or introns.
20
Search showing ALU hits.
22
Identifying ALU contaminated regions.
ALU BLAST database on the NCBI Web page. Repeat Masker:
23
Removing ALU contamination
Human repeat filtering on BLAST Web pages. RepeatMasker:
24
PSI-BLAST A normal BLASTP (protein-protein) run is performed.
A position-dependent matrix is built using the most significant matches to the database. The search is rerun using this profile. The cycle may be repeated until convergence. The result is a ‘matrix’ tailored to the query.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.