Download presentation
Presentation is loading. Please wait.
Published byLambert Little Modified over 8 years ago
1
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary
2
Our Goal: Take a tour of NCBI BLAST Review practicalities of submitting BLAST queries Understand BLAST output Do sequence comparisons using basic and advanced BLAST methods
3
BLAST is Good For You
4
Database Similarity Searching The method you’ll use most! Scans a database for alignments to a query sequence Can get tons of information –functionality –evolutionary history –important residues Basis for many forms of bioinformatic analysis
5
Most Common Tool: BLAST: basic local alignment search tool –NCBI and others Based on fast local alignment methods –Global alignment computationally intensive –Global alignment not always biologically significant –Breaks query down into words (K-tuples) –Finds regions of similarity NCBI uses BLAST 2.0 (gapped BLAST) –Balances speed and sensitivity
6
www.ncbi.nlm.nih.gov/BLAST/ Handy Information!
7
Mmmm, Many Flavors* * Chocolate Vanilla Swirl not available
8
Basic BLAST Flavors blastp: protein query vs. protein sequence database. blastn: nucleotide query vs. nucleotide sequence database. blastx: translated nucleotide query vs. protein sequence database tblastn: protein query vs. translated nucleotide sequence database tblastx: translated nucleotide query vs. translated nucleotide sequence database.
9
What program will best suit your query, and desired output? Protein comparisons give most meaningful results Sequence complexity: 20 aa vs. 4 nt. Moderately similar nucleotide sequences could encode a highly similar protein sequence! What’s Your Favorite Flavor?
10
Takeout Message #1… Compare sequences on the protein level unless you know your query does not encode a protein product
11
Using Basic BLAST Methods Example: MASH-1 protein sequence from mouse Can I find similar proteins in Human?
13
Links to Information
14
Input QueryChoose Database
15
Submitting Your Query Input query sequence –FASTA –Raw –Accession/ ID Choose Database –Many available; varies with program –For complete list follow the link to: http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases
16
Finds Conserved Domains Limit results with entrez query E-Value cut off
17
Submitting Your Query CD Search –Finds conserved domains in query sequence –Compares to patterns and profiles of CDs Limit by entrez query –Restricts results to single organism etc. E-value cut off –Restricts results to ones falling below defined e- value –Default = 10 –Will revisit concept of e-value
18
Filtering Matrix Gap Penalties
19
Submitting Your Query Low complexity filtering –Low complexity sequence can lead to spurious alignments –Filtering “hides” these regions –On by default –SEG (proteins) or DUST (nucleic acids) –Should turn it off in some cases… what if your entire sequence gets filtered?
20
Submitting Your Query Choice of scoring matrix –Different ones available –BLOSUM matrices based on observed frequencies of a.a. substitutions –Each tailored to different levels of sequence divergence and length –BLOSUM 62 = default –Shown to be best at detecting most protein similarities… don’t usually need to change –Follow link for detailed information
21
Submitting Your Query Gap Penalties –Accounts for insertions and deletions in different sequences –Scores are penalized for gaps to prevent aberrant alignments –Opening penalty is high; extension penalty is lower –Defaults may change depending on matrix choice –Rarely need to change default value
23
Take note Click for more info
24
Formatting Options
26
Understanding Your Results Graphic representation of results –Top of graph represents query sequence –Underlying bars show where hits occur –Colors represent alignment scores –Grey areas represent non similar regions surrounded by similar regions –Scrolling over bar shows accession and description of hit –Clicking on a bar takes you to its alignment with the query
27
Bit Scores E-values
28
Understanding Your Results Bit scores –Normalized raw score –Raw score = sum of substitution scores and gap penalties –Normalized on basis of scoring method –Can compare searches scored using different matrices –Higher is better, but don’t adequately represent significance of alignment
29
Understanding Your Results E-values –Indicator of alignment significance –Number of times an alignment with the same score could have arose by chance –Lower is better –E-values decrease exponentially as scores for an alignment increase
30
Examine Results
31
Understanding Your Results Alignments –Important to inspect them –Take note of percent identity and similarity between query and aligned sequence –Examine regions of similarity and gaps –What if a sub-optimal alignment is the most functionally significant one?
32
Takeout Message #2… Don’t trust your computer blindly: Examine and think about your results
33
Homology: Some Rules to Consider Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous 50% similarity over a short sequence often occurs by chance Low complexity regions can be highly similar without being homologous Homologous sequences not always highly similar
34
Takeout Message #3… Homology is like pregnancy
35
Basic BLAST Flavors for Special Occasions BLAST 2 Sequences (bl2seq) –Aligns two sequences of your choice –Can do different types of comparison ex. Blastx –Gives dot-plot like output VecScreen –Compares query with sequences of known cloning vectors Both very handy for sequencing!
36
Basic BLAST Flavors for Special Occasions BLAST against genomes –Many available –BLAST parameters pre-optimized –Handy for mapping query to genome Search for short exact matches –BLAST parameters pre-optimized –Great for checking probes and primers
37
Basic BLAST Flavors for Special Occasions megaBLAST –For aligning sequences which differ slightly due to sequencing errors etc. –Very efficient for long query sequences –Uses big word (k-tuple) sizes to start search Very fast –Accepts batch submissions of ESTs –Can upload files of sequences as queries More detailed info: see megaBLAST pages
38
Time to Sample the Buffet… Try questions 1 – 4, found at the end of the lab notes accompanying this lecture. We’ll discuss them in 15 - 20 minutes
39
Advanced BLAST Methods The NCBI BLAST pages have several advanced BLAST methods available –PSI-BLAST –PHI-BLAST –RPS-BLAST All are powerful methods based on protein similarities
40
More Complex Flavor: PSI-BLAST Position Specific Iterated – BLAST A cycling/iterative method –Gives increased sensitivity for detecting distantly related proteins –Can give insight into functional relationships –Very refined statistical methods Fast – still based on BLAST methods Simple to use
41
PSI-BLAST Principle 1.First, a standard blastp is performed 2.The highest scoring hits are used to generate a multiple alignment 3.A PSSM is generated from the multiple alignment. –Highly conserved residues get high scores –Less conserved residues get lower scores 4.Another similarity search is performed, this time using the new PSSM 5.Steps 2-4 can be repeated until convergence –No new sequences appear after iteration
42
Example: Aminoacyl tRNA Synthetases 20 enzymes for 20 amino acids Each is very different –Big, small, monomers, tetramers, strange globs… All bind to their appropriate tRNAs, with high specificity –Bind all for their amino acid, but none of the others TrpRS and TyrRS share only 13% sequence identity BUT, overall structures of TrpRS and TyrRS are similar Structure Function relationship
43
Same SCOP family based on catalytic domain Overall structure similarity noted
44
TyrRS and TrpRS are Similar Sequence similarity expected right? BUT: blastp of E.coli TyrRS against bacterial sequences in SwissProt does not show similarity with TrpRS –e-value cutoff of 10
45
No TrpRS!?
46
Try Using PSI-BLAST… PSI-BLAST available from BLAST main page Query form just like for blastp –BUT: one extra formatting option must be used –“Format for PSI-BLAST” – check it off! –Second e-value cutoff used to determine which alignments will be used for PSSM build… “Threshold for inclusion” First search using TyrRS as query –Db = SwissProt; limit = Bacteria [ORGN] –Threshold for inclusion = 0.005
49
After A Few Iterations…
50
TyrRS Similarity to TrpRS!
51
Power of PSI-BLAST We knew TyrRS and TrpRS were similarly –Functionally and structurally Blastp gave no indication PSI-BLAST was able to detect their weak sequence similarity A word of caution… be sure to inspect and think about the results included in the PSSM build. –Include/exclude sequences on basis of biological knowldge
52
Query Results Does the query really have a relationship with the results?
53
Takeout Message #4… Use you biological knowledge when doing PSI-Blast to yield the most significant results
54
Another Complex Flavour: PHI-BLAST Pattern Hit Initiated – BLAST PHI-BLAST principle: –Same method as PSI-BLAST –Starts first search with query sequence + pattern for a motif in the query PHI-BLAST finds sequences containing the motif and having significant sequence similarity in the vicinity of the motif occurrence –Highly specific
55
Example: TyrRS TyrRS contains the aaRS class-I signature Want to find sequences containing that motif, and regional similarity to TyrRS First: get the Prosite pattern for the class-I signature –Prosite = db of protein families and domains
56
http://ca.expasy.org/prosite aminoacyl-transfer RNA synthetase
57
P-x(0,2)-[GSTAN]-[DENQGAPK]-x-[LIVMFP]-[HT]-[LIVMYAC]-G- [HNTG]-[LIVMFYSTAGPC]
58
Insert Query Sequence Insert PHI Pattern
59
PHI-BLAST Results After first search, PHI-BLAST functions same as PSI-BLAST Result page is the same Can iterate in same way. Try it later if you like…
60
The Key to PHI- and PSI-BLAST Generating the multiple alignments to create PSSMs –Refines scoring in searches Annotated collections of multiple alignments defining domains exist –Conserved domain database (CDD) –Contains 18039 alignments (10013 last year) Can search the CDD using CD search –Uses RPS-BLAST
61
RPS-BLAST Reverse Position Specific – BLAST –Opposite of PSI-BLAST CDD multiple alignments converted to PSSMs PSSMs are processed and turned into a searchable database Queries are searched against PSSMs using RPS-BLAST Output indicates conserved domains within the query sequence
62
Example: CRADD protein
63
Click on picture to see CDD multiple alignment Click to see alignment with query cDART
64
Summary of Advanced BLAST Methods PSI-BLAST –Input: SEQUENCE –Database: SEQUENCES –Algorithm: Constructs a PSSM from an initial pass and uses this in the next pass –Output: Distantly related sequences –+ sensitive, -specific PHI-BLAST –Input: PROFILE + SEQUENCE –Database: SEQUENCES –Algorithm: Same as PSI-BLAST except start with a profile –Output: Sequences containing the domain and that are similar in the domain region –+sensitive, -> -specific RPS-BLAST –Input: SEQUENCE –Database: DOMAINS –Output: Domains found in the sequence –+sensitive, +specific
65
Back for Another Helping… Try the remaining questions in the notes!
66
Enlightenment begins with a BLAST Special Thanks to Sohrab Shah for the aaRS example and further BLAST enlightenment
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.