BLAST : Basic local alignment search tool B L A S T !

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Last lecture summary.
BLAST (Basic Local Alignment Search Tool) In addition to the exact word, BLAST considers related words based on BLOSUM62: the neighborhood. Once a word.
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
From Pairwise Alignment to Database Similarity Search.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
From Pairwise Alignment to Database Similarity Search.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biology 224 Tom Peavy Sept 20 & 22, 2010
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Biology 4900 Biocomputing.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Sequence Alignment. MRCKTETGAR MRCGTETGAR % identity 90% CATTATGATA GTTTATGATT 70%
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Chap. 4: Multiple Sequence Alignment
Introduction to Bioinformatics
Introduction to Bioinformatics DNA and Protein Database Searching BLAST: Basic local alignment search tool Xiaolong Wang College of Life Sciences Ocean.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST et BLAST avancé J.S. Bernardes/H. Richard
Courtesy of Jonathan Pevsner
Basic Local Alignment Sequence Tool (BLAST)
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
BLAST.
Johns Hopkins School of Medicine
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

BLAST : Basic local alignment search tool B L A S T !

Pairwise alignment: key points Pairwise alignments allow us to describe the percent identity two sequences share, as well as the percent similarity The score of a pairwise alignment includes positive values for exact matches, and other scores for mismatches and gaps PAM and BLOSUM matrices provide a set of rules for assigning scores. PAM10 and BLOSUM80 are examples of matrices appropriate for the comparison of closely related sequences. PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins. Global and local alignments can be made.

BLAST BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible. page 87

Why use BLAST? BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include identifying orthologs and paralogs discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function page 88

Four components to a BLAST search (1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click “BLAST” page 88

Step 1: Choose your sequence Sequence can be input in FASTA format or as accession number page 89

Example of the FASTA format for a BLAST query Fig page 32

Step 2: Choose the BLAST program

blastn (nucleotide BLAST) blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST) page 90

DNA potentially encodes six proteins page 92 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Choose the BLAST program ProgramInputDatabase 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnprotein DNA 36 tblastxDNA DNA Fig. 4.3 page 91

Step 3: choose the database nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence page 92-93

Step 4a: Select optional search parameters Entrez! algorithm organism

Step 4a: optional blastp search parameters Filter, mask Scoring matrix Word size Expect

Step 4a: optional blastn search parameters Filter, mask Match/mismatch scores Word size Expect

BLAST: optional parameters You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size change the output format page 93

Fig. 4.7 page 95

Fig. 4.8 page 96 More significant hits with filtering turned off… …and longer alignments

NCBI blast offers masking in lowercase or colored formats

(a) Query: human insulin NP_ Program: blastp Database: C. elegans RefSeq Default settings: Unfiltered (“composition-based statistics”) Our starting point: search human insulin against worm RefSeq proteins by blastp using default parameters

(b) Query: human insulin NP_ Program: blastp Database: C. elegans RefSeq Option: No compositional adjustment Note that the bit score, Expect value, and percent identity all change with the “no compositional adjustment” option

(c) Query: human insulin NP_ Program: blastp Database: C. elegans RefSeq Option: conditional compositional score matrix adjustment Note that the bit score, Expect value, and percent identity all change with the compositional score matrix adjustment

(d) Query: human insulin NP_ Program: blastp Database: C. elegans RefSeq Option: Filter low complexity regions Note that the bit score, Expect value, and percent identity all change with the filter option

(e) Query: human insulin NP_ Program: blastp Database: C. elegans RefSeq Option: Mask for lookup table only Note that the bit score, Expect value, and percent identity could change with the “mask for lookup table only” option

Step 4b: optional formatting parameters Alignment view Descriptions Alignments page 97

BLAST format options

BLAST search output: multiple alignment format

BLAST search output: top portion taxonomy database query program

taxonomy page 97

BLAST search output: graphical output

BLAST search output: tabular output High scores low E values Cut-off:.05? ?

Fig page 108 We will get to the bottom of a BLAST search in a few minutes…

BLAST: background on sequence alignment There are two main approaches to sequence alignment: [1] Global alignment (Needleman & Wunsch 1970) using dynamic programming to find optimal alignments between two sequences. (Although the alignments are optimal, the search is not exhaustive.) Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”). page 100

BLAST: background on sequence alignment [2] The second approach is local sequence alignment (Smith & Waterman, 1980). The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences. S-W is guaranteed to find optimal alignments, but it is computationally expensive (requires (O)n 2 time). BLAST and FASTA are heuristic approximations to local alignment. Each requires only (O)n 2 /k time; they examine only part of the search space. page 100; 71

How a BLAST search works “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990) (page 101, 102)

Pairwise alignment scores are determined using a scoring matrix such as Blosum62 Page 61

How a BLAST search works: threshold You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options of NetBLAST. (To find NetBLAST enter it as a query on the NCBI site map.) page 102

MegaBLAST: 50 kilobases of the globin locus

BLAST-related tools for genomic DNA Recently developed tools include: MegaBLAST at NCBI. BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror image of the BLAST strategy. See SSAHA at Ensembl uses a similar strategy as BLAT. See Page 136

To access BLAT, visit “BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 20 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates.”--BLAT website

Paste DNA or protein sequence here in the FASTA format

BLAT output includes browser and other formats

Blastz Fig. 5.17

How to interpret a BLAST search: expect value It is important to assess the statistical significance of search results. For global alignments, the statistics are poorly understood. For local alignments (including BLAST search results), the statistics are well understood. The scores follow an extreme value distribution (EVD) rather than a normal distribution. page 103

The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e - S page 105 How to interpret a BLAST search: expect value

This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high- scoring segment pairs (HSPs) expected to occur with a score of at least S m, n = the length of two sequences, K = Karlin Altschul statistics E = Kmn e - S page 105

Some properties of the equation E = Kmn e - S The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly page

From raw scores to bit scores There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices. page 106

The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e -  page 106 How to interpret BLAST: E values and p values

Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep (about 0.1) (about 0.05) (about 0.001) Table 4.4 page 107 How to interpret BLAST: E values and p values

How to interpret BLAST: getting to the bottom page 107

threshold score = 11 EVD parameters BLOSUM matrix Effective search space = mn = length of query x db length 10.0 is the E value gap penalties cut-off parameters Fig page 108

General concepts How to evaluate the significance of your results How to handle too many results How to handle too few results BLAST searching with HIV-1 pol, a multidomain protein BLAST searching with lipocalins using different matrices page BLAST search strategies

Sometimes a real match has an E value > 1 Fig page 110 … try a reciprocal BLAST to confirm

Sometimes a similar E value occurs for a short exact match and long less exact match Fig page 111 short, nearly exact long, only 31% identity, similar E value

Assessing whether proteins are homologous RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins. Fig page 111

Fig page 114 Searching with a multidomain protein, pol

Fig page 116

Searching bacterial sequences with pol Fig page 117

Position specific iterated BLAST: PSI-BLAST The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by employing a scoring matrix that is customized to your query. Page 137

PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database Page 138

PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) Page 138

R,I,KCD,E,TK,R,TN,L,Y,G Fig. 5.9 Page 139

A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A e fig. 5.4

A R N D C Q E G H I L K M F P S T W Y V 1 M K W V W A L L L L A A W A A A S G T W Y A Fig Page 140

PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values) Page 138

Fig Page 141

PSI-BLAST is performed in five steps [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values) [5] Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query. Page 138

Results of a PSI-BLAST search # hits Iteration# hits > threshold Table 5-4 Page 141

Handout Fig. 5.6 RBP4 match to ApoD, PSI-BLAST iteration 1

Handout Fig. 5.6 RBP4 match to ApoD, PSI-BLAST iteration 3