Download presentation
1
Database Searching:BLAST and FASTA
2
Biological Sequence Databases
NCBI Nucleotide GenBank, EMBL, DDBJ, GSDB, patent sequences Protein SwissProt, PIR, PRF, PDB (structural) PopSet GDB, FlyBase, OMIM, HGMD, MGD, GED There are other types of biological databases : EcoCyc, WIT/WIT2, others
3
Total amount of non-redundant, well-maintained sequence data is 10- 15 GB and growing expontially
This does not include lots of other easily accessible data like EST’s (expressed sequence tags)
4
In-class exercise: GenBank flatfile format
Open Netscape browser URL: Make sure GenBank appears in Search window; type in calmodulin; select Go Pick any of the returned files; double click to select it Observe output “flat file” Remember: the database is not a collection of flatfiles! You can do this for proteins too, to find the flatfile format for proteins Locus: unique identifier = accession number seq length, seq type, some other junk that’s there because of history of database Definition: summarizes biological sig of record (single text line) Accession number = primary key for database; stays with the sequence, citable gi number is geninfo identifier; changes with changes to sequence deposited in DB lots of other info: keywords, source, organism, reference(s), features (stuff about the sequence, including the sequence!) Add looking at the tables of GenBank in the server?
5
Database searching: the problem
Experiments yield a sequence, and you want to know what kind of gene/protein it is (or you have a known gene sequence from one organism and you want to know if another organism has a homologous gene) Databases have millions of sequences Need method to compare query sequences with all those in databases: speed vs. accuracy
6
The real problem … How do we compare sequences? Seq 1: CTGCACTA
Seq 2: CACTA or C---ACTA Add scores
7
The real problem … How do we compare sequences? Seq 1: CTGCACTA
Seq 2: CACTA or C---ACTA Scoring tries to approximate evolution: scores for substitutions and for gaps (insertions/deletions) Scores = sum of terms for substitutions and for gaps (sequence as character string) 41 17
8
Sequence alignment I Simplest scoring: 1 for match, 0 for no match
CTGCACTA CACTA C---ACTA Score = 5 NOTE we are not considering “overhangs” as gaps Score = 5
9
Sequence alignment II Slightly more advanced scoring: +1 for match, 0 for no match, -1 for gap CTGCACTA CACTA C---ACTA Score = 5 again, overhang = endgap is not penalized Score = 2
10
G C A T G C A T G C A T G C A T Identity scoring matrices: top, simple form; below, with mismatch penalty
11
In-class exercise II CCTGGGCTATGC CAGGGTT-TGC
Using the “advanced scoring method” calculate the scores for the following pairs of nucleotide sequences: CCTGGGCTATGC CAGGGTT-TGC CA-GGG-TTTGC For discussion afterwards: Score you get depends on whether you penalize end gaps. When we get to alignment algorithms, we will discuss this more.
12
Linear vs. affine gap penalties
Linear gap penalty: same penalty subtracted from each space in the gap Affine gap penalty: first space in the gap has a larger score than subsequent spaces in the gap; i.e., easier to lose/gain more subunits from a gap than to start a new gap/insertion (this makes sense evolutionarily)
13
Linear vs. Affine: example
Match = +1, mismatch = -1, gap = -2; CCTGGGCTATGC CC-GG-TT-TGC Same as above but with affine penalty = -1 CC--GGTT-TGC Score = 1 Score = 2
14
What about proteins? Chemistry of amino acids means that some substitutions in the sequence are better than others Substitution matrix: empirically derived scores for frequency of substitution of each amino acid for all 19 others. We will talk more about how these substitution matrices are derived later in the course
15
BLOSUM 62 Substitution matrix
16
In-class exercise III Using the BLOSUM62 substitution matrix and a gap penalty of -2, score the following pairs of protein sequences (do not penalize end gaps) YIHMNVFLSFML RVGAANFPNPRL FIHMNLFVSFML IHMNLFV--SFML IVLSMMFFLNHY These pairs of sequences have the following properties: First box: unrelated sequences picked arbitrarily from unrelated proteins Lower box, first column: gaps introduced Upper box, second column: related, ungapped sequences Last box: same sequence, randomized
17
In-class exercise IV Using the BLOSUM62 substitution matrix, find the scores of the following alignment using A) linear gap penalty of d = 8 B) affine gap penalty of d = 8, e = 2 AVAHV---D--DMPNALS AAIQLQVTGVVVTDATLK
18
Search algorithms Early search algorithms scored the whole query sequence against every library sequence; very accurate but very slow FASTA and BLAST search algorithms use the idea that similar sequences are likely to have small areas of exact matches
19
Original BLAST Algorithm
Make a list of short “words” = wmers from the query sequence: For nucleotide sequences wmers are used “as is” for protein sequences wmers are all words which score higher than some threshold T using a substitution matrix Choose all possible wmers by sliding a window of length w down the query sequence This is a later algorithm than the FASTA algorithm; we are starting with this one because it is conceptually simpler.
20
W-mers: DNA vs. protein DNA CACTAGCTAAA For w = 6. CACTAG ACTAGC
etc. Protein WRKRKKRTGLE For w=3, T=11 WRK WRR WKR RKR RKK, etc.
21
Original BLAST algorithm cont.
Scan the library for sequences that match wmers generated in step 1 Extend hits; any extensions must increase the score over that of the original hit; extension stops when no increase in score wmer extension This is illustrated in the movie at (supply address). Report hit extensions that exceed some threshold score S; the choice of S to include only “similar” sequences is derivable from Karlin Altschul statistics (next week). E is a parameter related to S; it is the expected number of hits returned from a HSP of that length from random matches in the database. Thus a small e-value indicates a low probability of the returned hit being random, which means it has a good probability of being significant.
22
Original BLAST algorithm cont
BLAST results can give more than one area of sequence similarity per pair of proteins compared BLAST results are more amenable to statistical analysis than FASTA Add comment about how BLAST works now; reiterate after going through FASTA Read about BLAST algorithm in text and in the assigned journal paper
23
Gapped BLAST improvements
BLAST now requires two hits within distance A on the same diagonal to trigger extension (overlapping hits are ignored); to keep sensitivity, T is lowered, increasing the number of hits; but since computation time is mostly in the extension phase the overall computation time is much reduced.
24
Gapped BLAST improvement, cont
Gapping is now allowed. This is handled by having a moderate cutoff score for HSP’s that triggers dynamic programming algorithm (discussed in two weeks) to align the two sequences in question Statistical analysis somewhat compromised, but still very good.
25
In-class exercise IV Open browser; go to www.ncbi.nlm.nih.gov
In “search” window, use pulldown menu to select proteins Type cytochrome P450 in “for” window; select go Select a protein by double-clicking In the database flatfile window, where it says “Default View”, open pulldown menu and select FASTA; then select the Display button Add link info here.
26
In-class exercise IV, cont.
Paint sequence starting with the first amino acid; select Editcopy; From top menu bar select Protein; in next window select BLAST from sidebar In BLAST window, under protein BLAST, select standard protein-protein BLAST Paste sequence into “search” box; keep the defaults Select BLAST
27
In-class exercise IV, cont.
In the next window, you will see information about the conserved domain (CD) search that is done by default; you should select Format first, then you can play with the CD results for a while. If you click on the CD image first, you can’t get back and have to start the process over. The CD search uses profile searching, which we will discuss later in the course The Format button will open a new window that will display results when the search is over.
28
BLAST results Bar graph with matches; list of matches with E values (more next time) Small and intermediate E values are most similar
29
FASTA algorithm AGCTGACGCA CTG GCA
First, look for all identities between small “word” = ktup and every sequence in database. Ktup size determines how many letters must be identical (e.g., 3) CTGCACTA CTG TGC GCA etc AGCTGACGCA CTG GCA
30
FASTA, cont a g c t C - T G A Ktup matches can be depicted in a matrix; diagonals indicate matches. For every library sequence, the 10 best diagonals are constructed from the ktup matches using a distance formula Read paper again and add stuff notes if necessary.
31
FASTA, cont. The top 10 diagonals are rescored using substitution matrices and identities smaller than ktup; each of these rescored diagonals is called an initial region.
32
FASTA, cont. Initial regions are then joined using a joining penalty (like a gap penalty). The highest score of the joined initial regions is then the score for that library sequence. The library sequences are ranked by this score.
33
FASTA, cont. In the last step of the FASTA algorithm, library sequences that scored above some threshold value are aligned with the query sequence and each other using alignment methods (dynamic programming) we will discuss later.
34
In-class exercise V Get into seqlab, open bioinfI.list Select FunctionsDatabase Reference SearchingLookup Select Search the chosen sequence libraries; select GenBank In the all text box, type hsp70 Select Run; do NOT close the green output window
35
In-class exercise, cont
Making sure nothing in main list is selected, toggle to editor (blank screen) Select FileAdd Sequences FromDatabases. Paint the name of the entry GB_BA1:BORHSP70 in the green output window with the left mouse button; click in the Database specification window with the left button, then click with the right button to paste. Select Add to main window.
36
In-class exercise Close the Database browser window and the green output window Select the name of the sequence in the editor; select functionsDatabase Sequence SearchingFASTA Select Search set; then Add database sequences; then Primate under GenEMBL; then click on Add to search set; then close.
37
In-class exercise cont
If there is anything besides the primate library in the search set box, delete it; you can save the primate search set if you want to; then close the search set window. Select Run on the FASTA window; this will take at least 5 minutes
38
Interpreting FASTA results
FASTA results are reported in a histogram of expected values compared to a random search set. We will discuss this more next time. The bottom part of the histogram contains the matches of interest (lowest probability of being random) This explained very well in the text: read this carefully. Also, pay close attention to the practical suggestions offered in the text for different conditions for using FASTA vs. BLASt
39
Practical considerations in searches
FASTA: varying ktup BLAST: varying S or E (w optimized by default) in advanced BLAST Secondary consideration: gap initiation and extension penalties (usual is about 6 and 2 or 6 and 1) Again, BLAST now allows gapped alignments. This means that the statistical advantage is not quite as clear, though it appears the estimation of the statistical parameters are still v. good. When to use BLAST and when to use FASTA? supposedly FASTA can be more sensitive on DNA. Many people do not believe this. In practice, if you are doing an exhaustive search for similar sequences you should use both. if you are not, start with BLAST.
40
PSI-BLAST PSI-BLAST (position-specific-iterated BLAST) is an example of a search method that uses profiles, which we will discuss later in the course PSI-BLAST starts off with a user-supplied query sequence, and a normal BLAST search is performed
41
If the user decides to iterate (e. g
If the user decides to iterate (e.g., if sequences with good E-values are returned in the first step), then the PSI-BLAST program aligns each returned sequence with the query sequence The PSI-BLAST program stacks up or aggregates all these individual alignments, producing something that looks like a multiple alignment but is really not
42
From this stack of pairwise alignments, a scoring matrix is calculated by first finding the frequency of amino acids in each column, and then weighting that frequency to take into account very frequent and less frequent amino acids Then the database is searched again, using this scoring matrix, which takes the variation at each sequence position of the query sequence into account This procedure can be repeated many times to expand the sequences retrieved by the original query sequence
43
Caveats for PSI-BLAST The results of a PSI-BLAST search can be skewed greatly by sequences that are recruited into the initial set; “greedy algorithm” Different but related initial query may give different results (though there is likely to be overlap) Homework question on PSI-BLAST
44
Other database search methods
SSEARCH: Based on dynamic programming algorithm; we will talk about this in two weeks. We will talk about PHI-BLAST, BLOCKS, PROBE, BAYES ALIGNER, and other profile-based methods later in the course
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.