Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Introduction to Bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Sequence analysis course
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BLAST.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
BLAST Workshop Maya Schushan June 2009.
BLAST : Basic local alignment search tool B L A S T !
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
School B&I TCD Bioinformatics Database homology searching May 2010.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
David Wishart David Wishart University of Alberta
Lecture 3.1 BLAST.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Identifying templates for protein modeling:
Sequence Based Analysis Tutorial
BLAST.
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Lecture 3.11 BLAST

Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing local alignments through searches of high scoring segment pairs (HSP’s) 1st to use statistics to predict significance of initial matches - saves on false leads Offers both sensitivity and speed

Lecture 3.13 Looks for clusters of nearby or locally dense “similar or homologous” k-tuples Uses “look-up” tables to shorten search time Uses larger “word size” than FASTA to accelerate the search process Performs both Global and Local alignment Fastest and most frequently used sequence alignment tool -- THE STANDARD BLAST

Lecture 3.14 BLAST Access NCBI BLAST Canadian Bioinformatics Resource BLAST European Bioinformatics Institute BLAST

Lecture 3.15

6

7

8 Different Flavours of BLAST BLASTP - protein query against protein DB BLASTN - DNA/RNA query against GenBank (DNA) BLASTX - 6 frame trans. DNA query against proteinDB TBLASTN - protein query against 6 frame GB transl. TBLASTX - 6 frame DNA query to 6 frame GB transl. PSI-BLAST - protein ‘profile’ query against protein DB PHI-BLAST - protein pattern against protein DB

Lecture 3.19 Other BLAST Services MEGABLAST - for comparison of large sets of long DNA sequences RPS-BLAST - Conserved Domain Detection BLAST 2 Sequences - for performing pairwise alignments for 2 chosen sequences Genomic BLAST - for alignments against select human, microbial or malarial genomes VecScreen - for detecting cloning vector contamination in sequenced data

Lecture Running NCBI BLAST

Lecture MT0895 MMKIQIYGTGCANCQMLEKNAREAVKELG IDAEFEKIKEMDQILEAGLTALPGLAVDG ELKIMGRVASKEEIKKILS

Lecture Paste in sequence (FASTA format, raw sequence or type in GI or accession number) Running NCBI BLAST >Mysequence MT0895 KIQIYGTGCANCQMLEKNAREAVKELGIDAE FEKIKEMDQILEAGLTALPGLAVDGELKIDS > KIQIYGTGCANCQMLEKNAREAVKELGIDAE FEKIKEMDQILEAGLTALPGLAVDGELKIDS OR KIQIYGTGCANCQMLEKNAREAVKELGIDAE FEKIKEMDQILEAGLTALPGLAVDGELKIDS OR

Lecture Choose a range of interest in the sequence “set subsequences” (not usually used) Select the database from pull-down menu (usually choose nr = non-redundant) Keep CD Search “check box” on Leave “Options” unchanged (use defaults) Go to “Format” menu and adjust Number of descriptions and alignments as desired Running NCBI BLAST

Lecture Running NCBI BLAST Select Database

Lecture Conserved Domain Database Contains a collection of pre-identified functional or structural domains Derived from Pfam and Smart databases as well as other sources Uses Reverse Position Specific BLAST (RPS-BLAST) to perform search Query sequence is compared to a PSSM derived from each of the aligned domains

Lecture Running NCBI BLAST Click BLAST!

Lecture Formatting Results

Lecture BLAST Format Options

Lecture BLAST Output

Lecture BLAST Output

Lecture BLAST Output

Lecture BLAST Output

Lecture BLAST Output

Lecture BLAST Output

Lecture BLAST Parameters Identities - No. & % exact residue matches Positives - No. and % similar & ID matches Gaps - No. & % gaps introduced Score - Summed HSP score (S) Bit Score - a normalized score (S’) Expect (E) - Expected # of chance HSP aligns P - Probability of getting a score > X T - Minimum word or k-tuple score (Threshold)

Lecture BLAST - Rules of Thumb Expect (E-value) is equal to the number of BLAST alignments with a given Score that are expected to be seen simply due to chance Don’t trust a BLAST alignment with an Expect score > 0.01 (Grey zone is between ) Expect and Score are related, but Expect contains more information. Note that %Identies is more useful than the bit Score Recall Doolittle’s Curve (%ID vs. Length, next slide) %ID > 30 - numres/50 If uncertain about a hit, perform a PSI-BLAST search

Lecture Doolittle’s Curve Twilight Zone

Lecture Getting the Most from BLAST

Lecture BLAST Options

Lecture BLAST Options Composition-based statistics (Yes) Sequence Complexity Filter (Yes) Expect (E) value (10) Word Size (3) Substitution or Scoring Matrix (Blosum62) Gap Insertion Penalty (11) Gap Extension Penalty (1)

Lecture Composition Statistics Recent addition to BLAST algorithm Permits calculated E (Expect) values to account for amino acid composition of queries and database hits Improves accuracy and reduces false positives Effectively conducts a different scoring procedure for each sequence in database

Lecture LCR’s (low complexity) Watch out for… –transmembrane or signal peptide regions –coil-coil regions –short amino acid repeats (collagen, elastin) –homopolymeric repeats BLAST uses SEG to mask amino acids BLAST uses DUST to mask bases

Lecture Scoring Matrices BLOSUM Matrices –Developed by Henikoff & Henikoff (1992) –BLOcks SUbstitution Matrix –Derived from the BLOCKS database PAM Matrices –Developed by Schwarz and Dayhoff (1978) –Point Accepted Mutation –Derived from manual alignments of closely related proteins

Lecture How to Make Your Own Matrix ACDEFGH.. ACDEFGK.. AADEFGH.. GCDEFGH.. ACAEYGK.. ACAEFAH.. PerformCalculateFill Sub AlignmentFrequenciesMatrix f (A,A) = A A C D C D... E #A obs #A exp f (C,A) = #C/A obs #A exp #C exp +

Lecture PAM versus BLOSUM First useful scoring matrix for protein Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent) Derived from small, closely related proteins with ~15% divergence Much later entry to matrix “sweepstakes” No evolutionary model is assumed Built from PROSITE derived sequence blocks Uses much larger, more diverse set of protein sequences (30% - 90% ID)

Lecture PAM versus BLOSUM Higher PAM numbers to detect more remote sequence similarities Lower PAM numbers to detect high similarities 1 PAM ~ 1 million years of divergence Errors in PAM 1 are scaled 250X in PAM 250 Lower BLOSUM numbers to detect more remote sequence similarities Higher BLOSUM numbers to detect high similarities Sensitive to structural and functional subsitution Errors in BLOSUM arise from errors in alignment

Lecture PAM Matricies PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times best for short alignments with high similarity PAM prepared by multiplying PAM 1 by itself a total of 120 times best for general alignment PAM prepared by multiplying PAM 1 by itself a total of 250 times best for detecting distant sequence similarity

Lecture BLOSUM Matricies BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default) BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments

Lecture Scraping the Bottom of the Barrel with Psi-BLAST

Lecture PSI-BLAST Algorithm Perform initial alignment with BLAST using BLOSUM 62 substitution matrix Construct a multiple alignment from matches Prepare position specific scoring matrix Use PSSM profile as the scoring matrix for a second BLAST run against database Repeat steps 3-5 until convergence

Lecture PSI-BLAST

Lecture PSI-BLAST PresS Iterate!

Lecture PSI-BLAST PresS Iterate!

Lecture PSI-BLAST

Lecture PSI-BLAST For Protein Sequences ONLY Much more sensitive than BLAST Slower (iterative process) Often yields results that are as good as many common threading methods SHOULD BE YOUR FIRST CHOICE IN ANALYZING A NEW SEQUENCE

Lecture BLAST against PDB

Lecture Still Confused?

Lecture Conclusions BLAST is the most important program in bioinformatics (maybe all of biology) BLAST is based on sound statistical principles (key to its speed and sensitivity) A basic understanding of its principles is key for using/interpreting BLAST output Use NBLAST or MEGABLAST for DNA Use PSI-BLAST for protein searches