A Practical Guide to NCBI BLAST

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Introduction to BLAST David Fristrom Bibliographer/Librarian
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
BLAST : Basic local alignment search tool B L A S T !
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Blast 1. Blast 2 Low Complexity masking >GDB1_WHEAT MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI.
NCBI FieldGuide NCBI Molecular Biology Resources Part 2 November 2008 Peter Cooper.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Copyright OpenHelix. No use or reproduction without express written consent1.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
School B&I TCD Bioinformatics Database homology searching May 2010.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Peter Cooper Using NCBI BLAST.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Sequence similarity, BLAST alignments & multiple sequence alignments
Bioinformatics for Research
Lecture 3.1 BLAST.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
BLAST.
BLAST.
Sequence alignment, Part 2
Comparative Genomics.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

A Practical Guide to NCBI BLAST Leonardo Mariño-Ramírez NCBI, NIH – Bethesda, USA 03/14/2017

NCBI Search Services and Tools Entrez integrated literature and molecular databases Viewers BLink protein similarities Graphical Sequence Viewer annotation viewer and analysis tool BLAST sequence similarity search service VAST structure similarity searches Tools, special services, standalone software Entrez Utilities Entrez API Standalone BLAST BLAST programs + databases Cn3D 3D structure viewer Genome Workbench sequence analysis / annotation platform SRA Utilities SRA Run Browser web access SRA toolkit standalone SRA manipulator and client <ncbi>/books/NBK25501/ <ncbi>/books/NBK1762/ <ncbi>/Structure/CN3D/cn3d.shtml <ncbi>/tools/gbench/ <ncbi>/Traces/sra/ 03/14/2017

Today’s Topics Basics of using NCBI BLAST Using the Web Interface Motivation, Statistics, Scoring, Family of Programs Using the Web Interface Other Web services COBALT – protein multiple alignment Primer BLAST MOLE-BLAST Hands-on 03/14/2017

What is BLAST? Widely used sequence similarity search tool Finds high scoring local alignments between two sequences (protein or DNA) Includes a model of score distributions for random local alignments Provides statistical significance for alignments 03/14/2017

BLAST Fundamentals BLAST tells you about non-chance similarities between biological sequences. If similarities are not due chance then they must be due to something else! Homology Simple identification All BLAST searches begin with a sequence protein or nucleotide experimentally determined or one from database 03/14/2017

What BLAST tells you Here’s my sequence… What is it related to? What does it do? Homology; Function Is it already in the database? (Identification) find the matching sequence in the database Where is it located or how is it organized? annotation problems comparing sequences looking for frame shifts 03/14/2017

BLAST Statistics Number of chance alignments = 48 thousand! Indistinguishable from chance The most important statistic: Expect value (e-value) Expected number of random alignments with a particular score or better Number of chance alignments = 7 X 10-18 Not due to chance The e-value depends directly on the size of the search space (database) Search the smallest database likely to contain the sequence of interest 03/14/2017

Scoring: Nucleotide Match=+2 Mismatch=-3 Gap -(5 + 4(2))= -13 03/14/2017

Scoring: Protein K K +5 D E +2 Q F -3 Gap -(11 + 6(1))= -17 D E +2 Scores from BLOSUM62, a position independent matrix – Same substitution gets the same score at all positions – All positions equally likely to change 03/14/2017

BLOSUM62 Protein Scoring Matix 03/14/2017

BLAST Family of Programs 03/14/2017

Nucleotide Search Programs blastn traditional BLAST algorithm most sensitive nucleotide search megablast larger word size Discontiguous megablast Cross-species comparisons Default nucleotide search program Best for Identification Same-species annotation 03/14/2017

Protein Search Programs (Position Independent scoring) blastp translating searches useful for unannotated protein coding regions six frame translations of query, database or both blastx – translated query tblastn – translated database tblastx – translated query and database 03/14/2017

Protein Domains and Position Specific Scoring Position-specific scoring model Multiple alignment-based Substitution scores depend on the position in the protein. Some positions are more important (less likely to change) More sensitive at identifying distant homologies Better at identifying structural / functional domain 03/14/2017

Position-Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3 catalytic loop 03/14/2017

Position-specific Programs (protein only) Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM) from initial set of BLAST alignments Position-Hit Initiated BLAST (PHI-BLAST) Focuses search around pattern (motif) Domain Enhanced Lookup Time Accelerated (DELTA) BLAST Uses conserved domain PSSM in first round of search Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs Conserved Domain Database Search Quickly identifies type of protein and potential function Runs with all blastp searches at the NCBI Identifies conserved domains in query 03/14/2017

Query Sequences 03/14/2017

Queries FASTA format, single or multiple Accessions, single or multiple Directly from the sequence dbs 03/14/2017

BLAST 2 (or more) Sequences Any search page convertible to BLAST 2 (or more) Seqs Can search small custom database Many who use this really want a global alignment 03/14/2017

Global Alignment Tool Needleman-Wunsch Includes all residues of both seqs Will align unrelated sequences Provides global stats Percent Identity Percent positives NP_000468 (ALB) vs. NP_000574 (GC) 03/14/2017

BLAST Databases 03/14/2017

Protein Databases Default database (nr) What’s not in nr? Services blastp blastx Default database (nr) Most comprehensive Useful subsets: RefSeq, Swiss-Prot, PDB What’s not in nr? US , European and Asian Patents Proteins from metagenomes Proteins from Next-Gen assemblies 03/14/2017

Nucleotide Databases Services megablast blastn tblastn tblastx 03/14/2017

Nucleotide Databases Default database (nr/nt) is not comprehensive Contains traditional GenBank and RefSeq RNA Useful subsets: RefSeq RNA, 16S rRNA RefSeqs What is not in nr/nt? The majority of nucleotide data Bulk sequences (EST, GSS, HTGS, STS) RefSeq Genomic Sequences (Chromosome, RefSeq Genomic, RefSeq Representative Genomes) US, European and Asian Patents (pat) Whole Genome Shotgun Contigs (WGS) (second largest) Transcriptome Shotgun Assemblies (TSA) Next-Gen RNA-Seq, DNA-Seq Reads (SRA) (largest set) 03/14/2017

Limiting Databases Search the smallest database likely to contain the sequence of interest. Organism limit Exclude predicted and uncultured Limit with Entrez query 03/14/2017

Genome Databases Comprehensive search for genomic data Finds the best set (most assembled) of genomic sequences 03/14/2017

Web Program Selection 03/14/2017

Nucleotide Programs More Sensitivity Speed Less 03/14/2017

Algorithm Parameters: General Increase Max target sequences Decrease Expect threshold Set to more stringent value: 1e-6 0.001 Let Expect threshold govern output not Max target sequences 03/14/2017

Nucleotide Repeat Filters Select the matching interspersed repeat filter when working with genomic DNA On by default on genome BLAST pages Without repeat filter With repeat filter 03/14/2017

Formatting options Dots for identities Coding Sequence Highlights frameshifts sequence changes Nuc and Prot 03/14/2017

Managing Your Results 03/14/2017

The Request ID (RID) is the key http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&RID=HKZG2PPT013 Uniquely identifies search settings and results Persists at NCBI for 36 hours View through Recent Results, My NCBI Allows sharing results and reformatting Send the RID to blast-help@ncbi.nlm.nih.gov to ask about a search 03/14/2017

Download Options Downloads all data for multiple queries in a single file XML / XML2 easiest to parse with script and / or redisplay Hit table compatible with Excel and other spreadsheet programs Search strategies can be used again on the web or in standalone 03/14/2017

Specialized BLAST Services 03/14/2017

Nucleotide Services PrimerBlast MOLE-BLAST primer designer / specificity checker Primer3 primer design Uses RefSeq annotation exon boundaries splice variants SNPs MOLE-BLAST Helps identify sources of 16S and other targeted sequences BLAST followed by global multiple alignment Clusters queries plus most similar database sequences Identifies taxonomic units (neighbors) Labels database sequences from type material for accurate ID 03/14/2017

Protein Services COBALT – Constraint Based Alignment Tool Protein global multiple alignment tool Uses conserved domains to guide alignment Extension to BLAST search SmartBLAST – Rapid protein identification tool Uses fast k-mer search Identifies closest match in reference organism database Produces multiple alignment and protein tree Prototype for on-the-fly protein similarity (BLink) 03/14/2017

BLAST Help Help desk: blast-help@ncbi.nlm.nih.gov 03/14/2017

More Help Links Help Manual: <ncbi>/books/NBK3831/ Learn: <ncbi>/home/learn.shtml Factsheets: <ftp>/pub/factsheets/ NCBI YouTube: <youtube>/ncbinlm NCBI Helpdesks General: info@ncbi.nlm.nih.gov BLAST: blast-help@ncbi.nlm.nih.gov 03/14/2017

Web Demonstrations Basic BLAST Genome BLAST SRA BLAST Primer BLAST blastp, creatine kinases COBALT extension Genome BLAST blastn, tomato ETR2 Potato genome BLAST Formatting options Genome context SRA BLAST Potato RNA-Seq Primer BLAST BRCA1 Exon Primers Microbial Genomes BLAST Chicken Gut 16S MOLE-BLAST Clustering Bovine Rumen 16S 03/14/2017