Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.

Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary

Our Goal: Take a tour of NCBI BLAST Review practicalities of submitting BLAST queries Understand BLAST output Do sequence comparisons using basic and advanced BLAST methods

BLAST is Good For You

Database Similarity Searching The method you’ll use most! Scans a database for alignments to a query sequence Can get tons of information –functionality –evolutionary history –important residues Basis for many forms of bioinformatic analysis

Most Common Tool: BLAST: basic local alignment search tool –NCBI and others Based on fast local alignment methods –Global alignment computationally intensive –Global alignment not always biologically significant –Breaks query down into words (K-tuples) –Finds regions of similarity NCBI uses BLAST 2.0 (gapped BLAST) –Balances speed and sensitivity

www.ncbi.nlm.nih.gov/BLAST/ Handy Information!

Mmmm, Many Flavors* * Chocolate Vanilla Swirl not available

Basic BLAST Flavors blastp: protein query vs. protein sequence database. blastn: nucleotide query vs. nucleotide sequence database. blastx: translated nucleotide query vs. protein sequence database tblastn: protein query vs. translated nucleotide sequence database tblastx: translated nucleotide query vs. translated nucleotide sequence database.

What program will best suit your query, and desired output? Protein comparisons give most meaningful results Sequence complexity: 20 aa vs. 4 nt. Moderately similar nucleotide sequences could encode a highly similar protein sequence! What’s Your Favorite Flavor?

Takeout Message #1… Compare sequences on the protein level unless you know your query does not encode a protein product

Using Basic BLAST Methods Example: MASH-1 protein sequence from mouse Can I find similar proteins in Human?

Links to Information

Input QueryChoose Database

Submitting Your Query Input query sequence –FASTA –Raw –Accession/ ID Choose Database –Many available; varies with program –For complete list follow the link to: http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases

Finds Conserved Domains Limit results with entrez query E-Value cut off

Submitting Your Query CD Search –Finds conserved domains in query sequence –Compares to patterns and profiles of CDs Limit by entrez query –Restricts results to single organism etc. E-value cut off –Restricts results to ones falling below defined e- value –Default = 10 –Will revisit concept of e-value

Filtering Matrix Gap Penalties

Submitting Your Query Low complexity filtering –Low complexity sequence can lead to spurious alignments –Filtering “hides” these regions –On by default –SEG (proteins) or DUST (nucleic acids) –Should turn it off in some cases… what if your entire sequence gets filtered?

Submitting Your Query Choice of scoring matrix –Different ones available –BLOSUM matrices based on observed frequencies of a.a. substitutions –Each tailored to different levels of sequence divergence and length –BLOSUM 62 = default –Shown to be best at detecting most protein similarities… don’t usually need to change –Follow link for detailed information

Submitting Your Query Gap Penalties –Accounts for insertions and deletions in different sequences –Scores are penalized for gaps to prevent aberrant alignments –Opening penalty is high; extension penalty is lower –Defaults may change depending on matrix choice –Rarely need to change default value

Take note Click for more info

Formatting Options

Understanding Your Results Graphic representation of results –Top of graph represents query sequence –Underlying bars show where hits occur –Colors represent alignment scores –Grey areas represent non similar regions surrounded by similar regions –Scrolling over bar shows accession and description of hit –Clicking on a bar takes you to its alignment with the query

Bit Scores E-values

Understanding Your Results Bit scores –Normalized raw score –Raw score = sum of substitution scores and gap penalties –Normalized on basis of scoring method –Can compare searches scored using different matrices –Higher is better, but don’t adequately represent significance of alignment

Understanding Your Results E-values –Indicator of alignment significance –Number of times an alignment with the same score could have arose by chance –Lower is better –E-values decrease exponentially as scores for an alignment increase

Examine Results

Understanding Your Results Alignments –Important to inspect them –Take note of percent identity and similarity between query and aligned sequence –Examine regions of similarity and gaps –What if a sub-optimal alignment is the most functionally significant one?

Takeout Message #2… Don’t trust your computer blindly: Examine and think about your results

Homology: Some Rules to Consider Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous 50% similarity over a short sequence often occurs by chance Low complexity regions can be highly similar without being homologous Homologous sequences not always highly similar

Takeout Message #3… Homology is like pregnancy

Basic BLAST Flavors for Special Occasions BLAST 2 Sequences (bl2seq) –Aligns two sequences of your choice –Can do different types of comparison ex. Blastx –Gives dot-plot like output VecScreen –Compares query with sequences of known cloning vectors Both very handy for sequencing!

Basic BLAST Flavors for Special Occasions BLAST against genomes –Many available –BLAST parameters pre-optimized –Handy for mapping query to genome Search for short exact matches –BLAST parameters pre-optimized –Great for checking probes and primers

Basic BLAST Flavors for Special Occasions megaBLAST –For aligning sequences which differ slightly due to sequencing errors etc. –Very efficient for long query sequences –Uses big word (k-tuple) sizes to start search Very fast –Accepts batch submissions of ESTs –Can upload files of sequences as queries More detailed info: see megaBLAST pages

Time to Sample the Buffet… Try questions 1 – 4, found at the end of the lab notes accompanying this lecture. We’ll discuss them in 15 - 20 minutes

Advanced BLAST Methods The NCBI BLAST pages have several advanced BLAST methods available –PSI-BLAST –PHI-BLAST –RPS-BLAST All are powerful methods based on protein similarities

More Complex Flavor: PSI-BLAST Position Specific Iterated – BLAST A cycling/iterative method –Gives increased sensitivity for detecting distantly related proteins –Can give insight into functional relationships –Very refined statistical methods Fast – still based on BLAST methods Simple to use

PSI-BLAST Principle 1.First, a standard blastp is performed 2.The highest scoring hits are used to generate a multiple alignment 3.A PSSM is generated from the multiple alignment. –Highly conserved residues get high scores –Less conserved residues get lower scores 4.Another similarity search is performed, this time using the new PSSM 5.Steps 2-4 can be repeated until convergence –No new sequences appear after iteration

Example: Aminoacyl tRNA Synthetases 20 enzymes for 20 amino acids Each is very different –Big, small, monomers, tetramers, strange globs… All bind to their appropriate tRNAs, with high specificity –Bind all for their amino acid, but none of the others TrpRS and TyrRS share only 13% sequence identity BUT, overall structures of TrpRS and TyrRS are similar Structure  Function relationship

Same SCOP family based on catalytic domain Overall structure similarity noted

TyrRS and TrpRS are Similar Sequence similarity expected right? BUT: blastp of E.coli TyrRS against bacterial sequences in SwissProt does not show similarity with TrpRS –e-value cutoff of 10

No TrpRS!?

Try Using PSI-BLAST… PSI-BLAST available from BLAST main page Query form just like for blastp –BUT: one extra formatting option must be used –“Format for PSI-BLAST” – check it off! –Second e-value cutoff used to determine which alignments will be used for PSSM build… “Threshold for inclusion” First search using TyrRS as query –Db = SwissProt; limit = Bacteria [ORGN] –Threshold for inclusion = 0.005

After A Few Iterations…

TyrRS Similarity to TrpRS!

Power of PSI-BLAST We knew TyrRS and TrpRS were similarly –Functionally and structurally Blastp gave no indication PSI-BLAST was able to detect their weak sequence similarity A word of caution… be sure to inspect and think about the results included in the PSSM build. –Include/exclude sequences on basis of biological knowldge

Query Results Does the query really have a relationship with the results?

Takeout Message #4… Use you biological knowledge when doing PSI-Blast to yield the most significant results

Another Complex Flavour: PHI-BLAST Pattern Hit Initiated – BLAST PHI-BLAST principle: –Same method as PSI-BLAST –Starts first search with query sequence + pattern for a motif in the query PHI-BLAST finds sequences containing the motif and having significant sequence similarity in the vicinity of the motif occurrence –Highly specific

Example: TyrRS TyrRS contains the aaRS class-I signature Want to find sequences containing that motif, and regional similarity to TyrRS First: get the Prosite pattern for the class-I signature –Prosite = db of protein families and domains

http://ca.expasy.org/prosite aminoacyl-transfer RNA synthetase

P-x(0,2)-[GSTAN]-[DENQGAPK]-x-[LIVMFP]-[HT]-[LIVMYAC]-G- [HNTG]-[LIVMFYSTAGPC]

Insert Query Sequence Insert PHI Pattern

PHI-BLAST Results After first search, PHI-BLAST functions same as PSI-BLAST Result page is the same Can iterate in same way. Try it later if you like…

The Key to PHI- and PSI-BLAST Generating the multiple alignments to create PSSMs –Refines scoring in searches Annotated collections of multiple alignments defining domains exist –Conserved domain database (CDD) –Contains 18039 alignments (10013 last year) Can search the CDD using CD search –Uses RPS-BLAST

RPS-BLAST Reverse Position Specific – BLAST –Opposite of PSI-BLAST CDD multiple alignments converted to PSSMs PSSMs are processed and turned into a searchable database Queries are searched against PSSMs using RPS-BLAST Output indicates conserved domains within the query sequence

Example: CRADD protein

Click on picture to see CDD multiple alignment Click to see alignment with query cDART

Summary of Advanced BLAST Methods PSI-BLAST –Input: SEQUENCE –Database: SEQUENCES –Algorithm: Constructs a PSSM from an initial pass and uses this in the next pass –Output: Distantly related sequences –+ sensitive, -specific PHI-BLAST –Input: PROFILE + SEQUENCE –Database: SEQUENCES –Algorithm: Same as PSI-BLAST except start with a profile –Output: Sequences containing the domain and that are similar in the domain region –+sensitive, -> -specific RPS-BLAST –Input: SEQUENCE –Database: DOMAINS –Output: Domains found in the sequence –+sensitive, +specific

Back for Another Helping… Try the remaining questions in the notes!

Enlightenment begins with a BLAST Special Thanks to Sohrab Shah for the aaRS example and further BLAST enlightenment

Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.

Similar presentations

Presentation on theme: "Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.

Similar presentations

Presentation on theme: "Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary."— Presentation transcript:

Similar presentations

About project

Feedback