Download presentation
Published byNigel Cook Modified over 9 years ago
1
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST
2
will be considered as homologous
Why BLAST? Finding homologous Homology- similarity between sequences that result from a common ancestor. Sequences look alike probably have the same function and structure. Use a sequence as a search query in order to find homologous sequences in a data base. Save time! – exploit the knowledge you have about your homologues, and conclude about your query. More then: 25% for proteins 70% for nucleotides will be considered as homologous
3
Answering basic questions such as:
Why BLAST? Finding homologous Answering basic questions such as: Which bacterial species have a protein that is related in lineage to a certain protein with known amino-acid sequence? Where does a certain sequence of DNA originate? What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined?
4
Searching a sequence database
Why BLAST? Searching a sequence database The idea: Use your sequence as a query to find homologous sequences in a sequence database Database A sequence taken from Venter’s trip
5
Searching a sequence database
Why BLAST? Searching a sequence database Database query
6
Searching a sequence database
Why BLAST? Searching a sequence database Database hit query
7
Why BLAST? Why Heuristics ? Database Query
Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days. Database Query 107 sequences
8
Why BLAST? Why Heuristics ?
Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days. Database Query 107 sequences 11.5 days is ok if we are doing it once. 150,000 searches (at least!!) are performed per day: >82,000,000 sequence records in GenBank.
9
Why BLAST? Terminology Query sequence - the sequence with which we are searching the database Hit – a sequence found in the database, suspected as homologous to the query sequence
10
BLAST (Basic Local Alignment Search Tool)
How does BLAST work? BLAST (Basic Local Alignment Search Tool) Goal: A fast search for homologues in a huge database One of the most widespread bioinformatics programs: Provides a solution to a fundamental need Emphasizes speed over sensitivity the databases are enorsmous and will only grow larger and larger… Cannot guarantee optimal alignment after finding the homologs via BLAST, an additional alignment program is needed Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215:
11
BLAST (Basic Local Alignment Search Tool)
How does BLAST work? BLAST (Basic Local Alignment Search Tool) The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: Discard irrelevant sequences Perform exact local alignment only with the remaining sequences Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215:
12
Searching a sequence database
How does BLAST work? Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous (HSP- the matched region)
13
BLAST Main paradigm Yes No For each database record & query:
Look for common words instead of trying all possible alignments between two sequences. If many common words are found: Then – The query and the record are homologues Find common words between record and query? Yes No Possible Homologs: Save record for further analysis Probably not homologs: Discard record Retrieve next record from database
14
Searching a sequence database
How does BLAST work? Searching a sequence database Inputs: Query sequence Database of sequences Word size (use default…) Substitution matrix (use default…) Gap penalty (use default…)
15
How does BLAST work? The parameters-
W : Word size – find W-mers in target/query 2-3 for aa, 6-11 for nucleotides. T : Threshold – focus on pairs scoring >T usually 11-13 X : Drop-off – stop extending when loss >X S : Score – the final score of segment pair
16
How do we discard irrelevant sequences quickly?
How does BLAST work? How do we discard irrelevant sequences quickly? Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) Save the words in a look-up table that can be searched quickly WTD TDF DFG FGY GYP … WTDFGYPAILKGGTAC
17
BLAST: discarding sequences
How does BLAST work? BLAST: discarding sequences When the user enters a query sequence, it is also divided into words. For each word, neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with the cutoff level (T) GFC (20) GFB GPC (11) WAC (5)
18
BLAST: discarding sequences
How does BLAST work? BLAST: discarding sequences A list is compiled including the possible neighboring words, for which only exact matches to word in the database are accepted. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. GFC (20) GFB GPC (11) WAC (5)
19
How does BLAST work? The algorithm: s t
Align a query sequence with the database. Find “hits”: short word pairs of length W with an ungapped alignment score of at least T. Extend alignments until score drops more than X below hitherto best score Consumes most of the processing time (>90%) s t
20
Try to extend the alignment
How does BLAST work? Try to extend the alignment Stop extending when the score of the alignment drops X beneath the maximal score obtained so far Discard segments with score < S ASKIOPLLWLAASFLHNEQAPALSDAN JWQEOPLWPLAASOIHLFACNSIFYAS Score=15 Score=17 Score=14
21
How does BLAST work? Two-Hit Gapped BLAST The goal: Faster algorithm
Reduce number of extensions Observations: HSP much longer than W often contains more than one word-pair Idea: focus on two or more words on same diagonal
22
Look for a seed: hits on the same diagonal which can be connected
How does BLAST work? Neighbor word Look for a seed: hits on the same diagonal which can be connected A At least 2 hits on the same diagonal with distance which is smaller than a predetermined cutoff Database record This is the filtering stage – many unrelated hits are filtered, saving lots of time! Query
23
How does BLAST work? Two-Hit Gapped BLAST
The new gapped BLAST algorithm: Start with the two hit method- (a) find two hits of score higher then T, within a distance A. (b) invoke an ungapped extension on the second hit. If the HSP generated has an expected score: (a) Trigger a gapped extension (b) If the final score has a significant E-value – report the gapped alignment.
24
The result – local alignment
How does BLAST work? The result – local alignment The result of BLAST will be a series of local alignments between the query and the different hits found
25
How does BLAST work? The scoring system
BLAST uses BLOSSOM62 as the scoring matrix to perform the alignment (default).
26
How does BLAST work? E-value
To asses the bits score we calculate E-value: E-value = The expected number of HSP’s with a score of at least S For each score S there is a specific E-value. Small E-value better score
27
In practice – BLAST uses estimations.
How does BLAST work? E-value Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology. E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). E-values between 10-2 and 1 do not indicate a good homology
28
Low complexity regions- filter
How does BLAST work? Low complexity regions- filter Low-complexity region- a region of a sequence is composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database should be filtered out with specialized programs.
29
Query sequence: DNA or protein?
For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable if we want to learn about homology?
30
Query sequence: DNA or protein?
Query type Nucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity
31
Query sequence: DNA or protein?
Query types Which search is preferable? The genetic code is redundant. Some amino acids are coded by more than one codon. Therefore, the DNA sequence can change while the amino acid sequence will remain the same. 2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter alphabet. 3. Protein comparison matrices are much more sensitive than those for DNA, i.e., similarity relationships are defined between two amino acids (PAM/Blosum). 4. DNA databases are much larger, meaning more random hits.
32
Query sequence: DNA or protein?
Amino acids are better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser
33
Why use a nucleotide sequence after all?
Query sequence: DNA or protein? Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons. Evolutionary distant proteins will exhibit a high similarity rather than a high identity. Hits can exhibit a long alignment (homology) or a short alignment (conserved domains). Why use a nucleotide sequence after all? Amino acids are better!
34
Query type The sequence query can be a nucleotide sequence or an amino acid sequence. But … we can translate the query sequence! The search is performed against a nucleotide or amino acid database. But … we can use translated databases! (e.g., trEMBL) All types of searches are possible: Query: DNA Protein Database: DNA Protein
35
Query type Nucleotide query can be translated and searched against protein databases: Translate all reading frames (3 + 3) Find long ORF. Amino acid query can be back-translated to and searched against nucleotide databases? During translation we lose information. A single amino acid sequence can be back-translated to many possible nucleotide sequences .
36
Query type 1. amino acid query against protein database (blastp)
identifying a protein sequence finding similar sequences in protein databases. 2. nucleotide query against nucleotide database (blastn) In non-coding regions (no ORF found)- Identify the query sequence or find similar sequences. Find primer binding sites or map short contiguous motifs 3. compares translated nucleotide query against protein database. (blastx) Useful when the query includes a coding region, and we try to find homologous proteins. Used extensively in analyzing EST sequences. This search is more sensitive than nucleotide blast since the comparison is performed at the protein level. 4. protein query against translated nucleotide database (tblastn) useful for finding protein homologs in unnannotated nucleotide data of coding regions (e.g., ESTs, draft genome records (HTG)). 5. translated nucleotide query against translated nucleotide database. (tblastz) Useful for identifying novel genes in error prone query sequences. Used for identifying potential proteins encoded by single pass read ESTs.
37
Position Specific Iterated BLAST
BLAST vs. PSI-BLAST PSI-BLAST Position Specific Iterated BLAST Use sequence information to build position specific scoring matrices More sensitive After 1 BLAST iteration, we invoke the different PSI-BLAST for a number of additional iterations
38
BLAST vs. PSI-BLAST PSI-BLAST Step 1:
Set a standard protein-protein BLAST search (BLOSUM62) Build a position specific scoring matrix (PSSM) according to MSA of the alignment results with low E-value. Step 2: Set a BLAST search using the PSSM to evaluate the alignment. PSSM vs. DB instead of seq vs. DB Update the PSSM according to the new result Go back to the beginning of step two or stop.
39
BLAST vs. PSI-BLAST PSI-BLAST Searching with a Profile
aligning profile matrix to a simple sequence like aligning two sequences except score for aligning a character with a matrix position is given by the matrix itself not a substitution matrix
40
BLAST vs. PSI-BLAST PSI-BLAST
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
41
BLAST vs. PSI-BLAST Testing PSI-BLAST
Compare sensitivity and speed of: • Smith-Waterman • Original BLAST • Gapped BLAST • PSI-BLAST
42
BLAST vs. PSI-BLAST Testing PSI-BLAST All but one are true homologs
PSI-BLAST is faster and more sensitive Other BLAST algorithms good as well
43
The power of PSI-BLAST:
BLAST vs. PSI-BLAST The power of PSI-BLAST: A much sensitive scoring system . each position has its own pattern probabilities . Different weight to conserved positions. Important motifs are bounded Lowers the level of random noise. Finding distant relatives.
44
BLAST vs. PSI-BLAST Lets sum up…
Blast is a fast way to find homologues No analytic theory that estimates the statistical significance of gapped alignments Gap scores have been selected by trial and error. applying different scoring matrix No grantee for gap scores PSI-BLAST finds weak homologues fast
45
Finding & selecting homologues
Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt or Uniprot or UniRef90 (recommended!) How many? As many as possible, as long as the MSA looks good (examples in the next hour…)
46
Finding & selecting homologues
How long? (length of homologues) Fragments- short homologues (less than 50,60% the query’s length) = bad alignment Ensure your sequences exhibit the wanted domain(s) N/C terminal tend to vary in length between homologues Can use HSPs or full sequences, depends on which case you are working on… How close? (distance from query sequence) All too close- no information Too many too far- bad alignment Ensure that you have a balanced collection!
47
Finding & selecting homologues
From who? (which species the sequence belongs to) Don’t care, all homologues are welcome Orthologues/paralogues may be helpful Sequences from distant/close species provide different types of information Which method? (BLAST/PSI-BLAST) Depends on the protein, available homologues, the goal in mind…
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.