Http://creativecommons.org/licenses/by-sa/2.0/ Lecture 3.1 BLAST.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.
Heuristic alignment algorithms and cost matrices
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Sequence Analysis Tools
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
BLAST : Basic local alignment search tool B L A S T !
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Sequence similarity, BLAST alignments & multiple sequence alignments
Bioinformatics for Research
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
BLAST.
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

http://creativecommons.org/licenses/by-sa/2.0/ Lecture 3.1 BLAST

Sequence Similarity Searching: Understanding and Using Web Based BLAST First & Last Name February X, 2005 Sequence Similarity Searching: Understanding and Using Web Based BLAST Dr. Joanne Fox joanne@bioinformatics.ubc.ca Lecture 3.1 BLAST (c) 2005 CGDN

Concepts of Sequence Similarity Searching The premise: The sequence itself is not informative; it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function. Lecture 3.1 BLAST

Important Terms for Sequence Similarity Searching with very different meanings The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Homology Similarity attributed to descent from a common ancestor. It is your responsibility as an informed bioinformatician to use these terms correctly: A sequence is either homologous or not. Don’t use % with this term! Lecture 3.1 BLAST

Sequence Similarity Searching: The Approach Sequence similarity searching involves the use of a set of algorithms (such as the BLAST programs) to compare a query sequence to all the sequences in a specified database. Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared. The higher the score, the greater the degree of similarity. The similarity is measured and shown by aligning two sequences. Lecture 3.1 BLAST

Sequence Similarity Searching – The Alignment Alignments can be global or local (this is algorithm specific) A global alignment is an optimal alignment that includes all characters from each sequence (clustal generates global alignments) A local alignment is an optimal alignment that includes only the most similar local region or regions (BLAST generates local alignments). Lecture 3.1 BLAST

QUERY sequence(s) BLAST results BLAST program BLAST database Lecture 3.1 BLAST

BLAST program Topics: The different blast programs Understanding the BLAST algorithm Word size HSPs Understanding BLAST statistics The alignment score (S) Scoring Matrices Dealing with gaps in an alignment The expectation value (E) Lecture 3.1 BLAST

The BLAST algorithm The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402. Lecture 3.1 BLAST

http://www.ncbi.nlm.nih.gov/BLAST/ blastp blastn blastx tblastn tblastx Lecture 3.1 BLAST

Several different BLAST programs:   Several different BLAST programs: Program  Description blastp Compares an amino acid query sequence against a protein sequence database. blastn Compares a nucleotide query sequence against a nucleotide sequence database. blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive. Lecture 3.1 BLAST

Other BLAST programs BLAST 2 Sequences (bl2seq) VecScreen Aligns two sequences of your choice Can do different types of comparison ex. Blastx Gives dot-plot like output VecScreen Compares query with sequences of known cloning vectors Both very handy for sequencing! Lecture 3.1 BLAST

More BLAST programs BLAST against genomes Many available BLAST parameters pre-optimized Handy for mapping query to genome Search for short exact matches Great for checking probes and primers Lecture 3.1 BLAST

MegaBLAST megaBLAST More detailed info: see megaBLAST pages For aligning sequences which differ slightly due to sequencing errors etc. Very efficient for long query sequences Uses big word (k-tuple) sizes to start search Very fast Accepts batch submissions of ESTs Can upload files of sequences as queries More detailed info: see megaBLAST pages Lecture 3.1 BLAST

How Does BLAST Really Work? The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". Lecture 3.1 BLAST

Picture used with permission from Chapter 11 of “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins” Lecture 3.1 BLAST

How Does BLAST Really Work? The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". Lecture 3.1 BLAST

Picture used with permission from Chapter 11 of “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins” Lecture 3.1 BLAST

Each BLAST “hit” generates an alignment that can contain one or more these high scoring pairs (HSPs) Lecture 3.1 BLAST

Where does the score (S) come from? The quality of each pair-wise alignment is represented as a score and the scores are ranked. Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). The alignment score will be the sum of the scores for each position. Lecture 3.1 BLAST

What’s a scoring matrix? Substitution matrices are used for amino acid alignments. These are matrices in which each possible residue substitution is given a score reflecting the probability that it is related to the corresponding residue in the query. A unitary matrix is used for DNA pairs because each position can be given a score of +1 if it matches and a score of zero if it does not. Lecture 3.1 BLAST

PAM vs. BLOSUM scoring matrices BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Lecture 3.1 BLAST

PAM vs BLOSUM scoring matrices The PAM Family PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. The BLOSUM family BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. Lecture 3.1 BLAST

What happens if you have a gap in the alignment? A gap is a position in the alignment at which a letter is paired with a null Gap scores are negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is frequently ascribed more significance than the length of the gap. Hence the gap is penalized heavily, whereas a lesser penalty is assigned to each subsequent residue in the gap. Lecture 3.1 BLAST

What do the Score and the e-value really mean? The quality of the alignment is represented by the Score. Score (S) The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically . The significance of each alignment is computed as an E value. E value (E) Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Lecture 3.1 BLAST

Is the E-value the same P-value? E value (E) Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. When E < 0.01, P-values and E-value are nearly identical. So, the E-value is the number of times you expect to see your hit occur in the database (with as good as or better score) due to randomn chance alone. Lecture 3.1 BLAST

QUERY sequence(s) BLAST results BLAST program BLAST database Lecture 3.1 BLAST

BLAST databases Topics: The different blast databases provided by the NCBI Protein databases Nucleotide databases Genomic databases Considerations for choosing a BLAST database Custom databases for BLAST Lecture 3.1 BLAST

BLAST protein databases available at through blastp web interface @ NCBI blastp db Lecture 3.1 BLAST

BLAST nucleotide databases available at through blastn web interface @ NCBI blastn db Lecture 3.1 BLAST

Considerations for choosing a BLAST database First consider your research question: Are you looking for an ortholog in a particular species? BLAST against the genome of that species. Are you looking for additional members of a protein family across all species? BLAST against nr, if you can’t find hits check wgs, htgs, and the trace archives. Are you looking to annotate genes in your species of interest? BLAST against known genes (RefSeq) and/or ESTs from a closely related species. Lecture 3.1 BLAST

When choosing a database for BLAST… It is important to know your reagents. Changing your choice of database is changing your search space completely Database size affects the BLAST statistics record BLAST parameters, database choice, database size in your bioinformatics lab book, just as you would for your wet-bench experiments. Databases change rapidly and are updated frequently It may be necessary to repeat your analyses Lecture 3.1 BLAST

Creating Custom Databases for BLAST UBiC FAQ Lecture 3.1 BLAST

QUERY sequence(s) BLAST results BLAST program BLAST database Lecture 3.1 BLAST

BLAST results Topics: Choosing the right BLAST program First & Last Name February X, 2005 Topics: BLAST results Choosing the right BLAST program Running a blastp search BLAST parameters and options to consider Viewing BLAST results Look at your alignments Using the BLAST taxonomy report Lecture portion ends – should be 1hr to here (34 slides) that leaves 20 min for the remaining ~25 demo slides Lecture 3.1 BLAST (c) 2005 CGDN

http://www.ncbi.nlm.nih.gov/BLAST/ blastp blastn blastx tblastn tblastx Lecture 3.1 BLAST

http://www.ncbi.nlm.nih.gov/BLAST/ Program selection guide Lecture 3.1 BLAST

“What BLAST program should I use “What BLAST program should I use?” – check the NCBI’s BLAST Program selection guide Lecture 3.1 BLAST

http://www.ncbi.nlm.nih.gov/BLAST/ blastp Lecture 3.1 BLAST

Input your query (gi|231571) as FASTA, raw sequence, or Accession/ID and choose your database Lecture 3.1 BLAST

Links to more information can be found on the BLAST page Lecture 3.1 BLAST

BLAST parameters and options to consider: conserved domains Entrez query E-value cutoff Word size Lecture 3.1 BLAST

More BLAST parameters and options to consider: filtering gap penalities matrix Lecture 3.1 BLAST

Run your BLAST search: BLAST Lecture 3.1 BLAST

The BLAST Queue: click for more info Note your RID Lecture 3.1 BLAST

Formatting and Retrieving your BLAST results: options Lecture 3.1 BLAST

A graphical view of your BLAST results: Lecture 3.1 BLAST

The BLAST “hit” list: Score E-Value GenBank alignment EntrezGene Lecture 3.1 BLAST

The BLAST pairwise alignments Identity Similarity Lecture 3.1 BLAST

Sorting BLAST results by Taxonomy Report Lecture 3.1 BLAST

Tax BLAST Report Summary hits by lineage BLAST hits by organism Lecture 3.1 BLAST

BLAST statistics to record in your bioinformatics labbook Record the statistics that are found at bottom of your BLAST results page Lecture 3.1 BLAST

Homology: Some Guidelines First & Last Name February X, 2005 Homology: Some Guidelines Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous Low complexity regions can be highly similar without being homologous Homologous sequences not always highly similar Suggested BLAST Cutoffs (source: Chapter 11 – Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins) For nucleotide based searches, one should look for hits with E-values of 10-6 or less and sequence identity of 70% or more For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more Need to bring class back together with 10 minutes to go in classroom time. Lecture 3.1 BLAST (c) 2005 CGDN

Advanced BLAST programs The NCBI BLAST pages have several advanced BLAST methods available PSI-BLAST PHI-BLAST RPS-BLAST All are powerful methods based on protein similarities Lecture 3.1 BLAST

PSI-BLAST Position Specific Iterated – BLAST A cycling/iterative method Gives increased sensitivity for detecting distantly related proteins Can give insight into functional relationships Very refined statistical methods Fast – still based on BLAST methods Simple to use Lecture 3.1 BLAST

How does PSI-BLAST work? First, a standard blastp is performed The highest scoring hits are used to generate a multiple alignment A Position Specific Scoring Matrix (PSSM) is generated from the multiple alignment. Highly conserved residues get high scores Less conserved residues get lower scores The PSSM describes the sequence similarity between your query and all significant blastp hits Another similarity search is performed, this time using the new PSSM instead of the standard BLOSUM or PAM matrices - This PSSM (scoring matrix) is now customized to find sequences that are related to your original query Steps 2-4 can be repeated until convergence Convergence occurs when no new sequences appear after iteration Lecture 3.1 BLAST

http://www.ncbi.nlm.nih.gov/BLAST/ PSI-BLAST Lecture 3.1 BLAST

Format results for PSI-BLAST with inclusion E-value set at 0.005 Lecture 3.1 BLAST

Contributors Special thanks to David Wishart, Andy Baxevanis, Stephanie Minnema, Sohrab Shah, and Francis Ouellette for their contributions to these materials You are now ready to complete the BLAST assignment Lecture 3.1 BLAST