Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.
PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated,
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Psi-Blast: Detecting structural homologs Psi-Blast was designed to detect homology for highly divergent amino acid sequences Psi = position-specific iterated.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
BLAST : Basic local alignment search tool B L A S T !
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
School B&I TCD Bioinformatics Database homology searching May 2010.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Lecture 3.1 BLAST.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
Identifying templates for protein modeling:
BLAST.
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary

Our Goal: Take a tour of NCBI BLAST Review practicalities of submitting BLAST queries Understand BLAST output Do sequence comparisons using basic and advanced BLAST methods

BLAST is Good For You

Database Similarity Searching The method you’ll use most! Scans a database for alignments to a query sequence Can get tons of information –functionality –evolutionary history –important residues Basis for many forms of bioinformatic analysis

Most Common Tool: BLAST: basic local alignment search tool –NCBI and others Based on fast local alignment methods –Global alignment computationally intensive –Global alignment not always biologically significant –Breaks query down into words (K-tuples) –Finds regions of similarity NCBI uses BLAST 2.0 (gapped BLAST) –Balances speed and sensitivity

Handy Information!

Mmmm, Many Flavors* * Chocolate Vanilla Swirl not available

Basic BLAST Flavors blastp: protein query vs. protein sequence database. blastn: nucleotide query vs. nucleotide sequence database. blastx: translated nucleotide query vs. protein sequence database tblastn: protein query vs. translated nucleotide sequence database tblastx: translated nucleotide query vs. translated nucleotide sequence database.

What program will best suit your query, and desired output? Protein comparisons give most meaningful results Sequence complexity: 20 aa vs. 4 nt. Moderately similar nucleotide sequences could encode a highly similar protein sequence! What’s Your Favorite Flavor?

Takeout Message #1… Compare sequences on the protein level unless you know your query does not encode a protein product

Using Basic BLAST Methods Example: MASH-1 protein sequence from mouse Can I find similar proteins in Human?

Links to Information

Input QueryChoose Database

Submitting Your Query Input query sequence –FASTA –Raw –Accession/ ID Choose Database –Many available; varies with program –For complete list follow the link to:

Finds Conserved Domains Limit results with entrez query E-Value cut off

Submitting Your Query CD Search –Finds conserved domains in query sequence –Compares to patterns and profiles of CDs Limit by entrez query –Restricts results to single organism etc. E-value cut off –Restricts results to ones falling below defined e- value –Default = 10 –Will revisit concept of e-value

Filtering Matrix Gap Penalties

Submitting Your Query Low complexity filtering –Low complexity sequence can lead to spurious alignments –Filtering “hides” these regions –On by default –SEG (proteins) or DUST (nucleic acids) –Should turn it off in some cases… what if your entire sequence gets filtered?

Submitting Your Query Choice of scoring matrix –Different ones available –BLOSUM matrices based on observed frequencies of a.a. substitutions –Each tailored to different levels of sequence divergence and length –BLOSUM 62 = default –Shown to be best at detecting most protein similarities… don’t usually need to change –Follow link for detailed information

Submitting Your Query Gap Penalties –Accounts for insertions and deletions in different sequences –Scores are penalized for gaps to prevent aberrant alignments –Opening penalty is high; extension penalty is lower –Defaults may change depending on matrix choice –Rarely need to change default value

Take note Click for more info

Formatting Options

Understanding Your Results Graphic representation of results –Top of graph represents query sequence –Underlying bars show where hits occur –Colors represent alignment scores –Grey areas represent non similar regions surrounded by similar regions –Scrolling over bar shows accession and description of hit –Clicking on a bar takes you to its alignment with the query

Bit Scores E-values

Understanding Your Results Bit scores –Normalized raw score –Raw score = sum of substitution scores and gap penalties –Normalized on basis of scoring method –Can compare searches scored using different matrices –Higher is better, but don’t adequately represent significance of alignment

Understanding Your Results E-values –Indicator of alignment significance –Number of times an alignment with the same score could have arose by chance –Lower is better –E-values decrease exponentially as scores for an alignment increase

Examine Results

Understanding Your Results Alignments –Important to inspect them –Take note of percent identity and similarity between query and aligned sequence –Examine regions of similarity and gaps –What if a sub-optimal alignment is the most functionally significant one?

Takeout Message #2… Don’t trust your computer blindly: Examine and think about your results

Homology: Some Rules to Consider Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous 50% similarity over a short sequence often occurs by chance Low complexity regions can be highly similar without being homologous Homologous sequences not always highly similar

Takeout Message #3… Homology is like pregnancy

Basic BLAST Flavors for Special Occasions BLAST 2 Sequences (bl2seq) –Aligns two sequences of your choice –Can do different types of comparison ex. Blastx –Gives dot-plot like output VecScreen –Compares query with sequences of known cloning vectors Both very handy for sequencing!

Basic BLAST Flavors for Special Occasions BLAST against genomes –Many available –BLAST parameters pre-optimized –Handy for mapping query to genome Search for short exact matches –BLAST parameters pre-optimized –Great for checking probes and primers

Basic BLAST Flavors for Special Occasions megaBLAST –For aligning sequences which differ slightly due to sequencing errors etc. –Very efficient for long query sequences –Uses big word (k-tuple) sizes to start search Very fast –Accepts batch submissions of ESTs –Can upload files of sequences as queries More detailed info: see megaBLAST pages

Time to Sample the Buffet… Try questions 1 – 4, found at the end of the lab notes accompanying this lecture. We’ll discuss them in minutes

Advanced BLAST Methods The NCBI BLAST pages have several advanced BLAST methods available –PSI-BLAST –PHI-BLAST –RPS-BLAST All are powerful methods based on protein similarities

More Complex Flavor: PSI-BLAST Position Specific Iterated – BLAST A cycling/iterative method –Gives increased sensitivity for detecting distantly related proteins –Can give insight into functional relationships –Very refined statistical methods Fast – still based on BLAST methods Simple to use

PSI-BLAST Principle 1.First, a standard blastp is performed 2.The highest scoring hits are used to generate a multiple alignment 3.A PSSM is generated from the multiple alignment. –Highly conserved residues get high scores –Less conserved residues get lower scores 4.Another similarity search is performed, this time using the new PSSM 5.Steps 2-4 can be repeated until convergence –No new sequences appear after iteration

Example: Aminoacyl tRNA Synthetases 20 enzymes for 20 amino acids Each is very different –Big, small, monomers, tetramers, strange globs… All bind to their appropriate tRNAs, with high specificity –Bind all for their amino acid, but none of the others TrpRS and TyrRS share only 13% sequence identity BUT, overall structures of TrpRS and TyrRS are similar Structure  Function relationship

Same SCOP family based on catalytic domain Overall structure similarity noted

TyrRS and TrpRS are Similar Sequence similarity expected right? BUT: blastp of E.coli TyrRS against bacterial sequences in SwissProt does not show similarity with TrpRS –e-value cutoff of 10

No TrpRS!?

Try Using PSI-BLAST… PSI-BLAST available from BLAST main page Query form just like for blastp –BUT: one extra formatting option must be used –“Format for PSI-BLAST” – check it off! –Second e-value cutoff used to determine which alignments will be used for PSSM build… “Threshold for inclusion” First search using TyrRS as query –Db = SwissProt; limit = Bacteria [ORGN] –Threshold for inclusion = 0.005

After A Few Iterations…

TyrRS Similarity to TrpRS!

Power of PSI-BLAST We knew TyrRS and TrpRS were similarly –Functionally and structurally Blastp gave no indication PSI-BLAST was able to detect their weak sequence similarity A word of caution… be sure to inspect and think about the results included in the PSSM build. –Include/exclude sequences on basis of biological knowldge

Query Results Does the query really have a relationship with the results?

Takeout Message #4… Use you biological knowledge when doing PSI-Blast to yield the most significant results

Another Complex Flavour: PHI-BLAST Pattern Hit Initiated – BLAST PHI-BLAST principle: –Same method as PSI-BLAST –Starts first search with query sequence + pattern for a motif in the query PHI-BLAST finds sequences containing the motif and having significant sequence similarity in the vicinity of the motif occurrence –Highly specific

Example: TyrRS TyrRS contains the aaRS class-I signature Want to find sequences containing that motif, and regional similarity to TyrRS First: get the Prosite pattern for the class-I signature –Prosite = db of protein families and domains

aminoacyl-transfer RNA synthetase

P-x(0,2)-[GSTAN]-[DENQGAPK]-x-[LIVMFP]-[HT]-[LIVMYAC]-G- [HNTG]-[LIVMFYSTAGPC]

Insert Query Sequence Insert PHI Pattern

PHI-BLAST Results After first search, PHI-BLAST functions same as PSI-BLAST Result page is the same Can iterate in same way. Try it later if you like…

The Key to PHI- and PSI-BLAST Generating the multiple alignments to create PSSMs –Refines scoring in searches Annotated collections of multiple alignments defining domains exist –Conserved domain database (CDD) –Contains alignments (10013 last year) Can search the CDD using CD search –Uses RPS-BLAST

RPS-BLAST Reverse Position Specific – BLAST –Opposite of PSI-BLAST CDD multiple alignments converted to PSSMs PSSMs are processed and turned into a searchable database Queries are searched against PSSMs using RPS-BLAST Output indicates conserved domains within the query sequence

Example: CRADD protein

Click on picture to see CDD multiple alignment Click to see alignment with query cDART

Summary of Advanced BLAST Methods PSI-BLAST –Input: SEQUENCE –Database: SEQUENCES –Algorithm: Constructs a PSSM from an initial pass and uses this in the next pass –Output: Distantly related sequences –+ sensitive, -specific PHI-BLAST –Input: PROFILE + SEQUENCE –Database: SEQUENCES –Algorithm: Same as PSI-BLAST except start with a profile –Output: Sequences containing the domain and that are similar in the domain region –+sensitive, -> -specific RPS-BLAST –Input: SEQUENCE –Database: DOMAINS –Output: Domains found in the sequence –+sensitive, +specific

Back for Another Helping… Try the remaining questions in the notes!

Enlightenment begins with a BLAST Special Thanks to Sohrab Shah for the aaRS example and further BLAST enlightenment