NCBI Minicourses BLAST Quick Start

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
A Field Guide part 2 National Center for Biotechnology Information
Last lecture summary.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
NCBI Minicourses BLAST Quick Start
We continue where we stopped last week: FASTA – BLAST
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Sequence alignment, E-value & Extreme value distribution
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
BLAST : Basic local alignment search tool B L A S T !
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Peter Cooper Using NCBI BLAST.
NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
NCBI FieldGuide NCBI Molecular Biology Resources Part 2 November 2008 Peter Cooper.
NCBI FieldGuide MapViewer Genome Resources and Sequence SimilarityLocusLink UniGene Homologene Basic Local Alignment Search Tool Gene database.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
School B&I TCD Bioinformatics Database homology searching May 2010.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Peter Cooper Using NCBI BLAST.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Biology 4900 Biocomputing.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Courtesy of Jonathan Pevsner
A Practical Guide to NCBI BLAST
Lecture 3.1 BLAST.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
BLAST.
Sequence alignment, Part 2
Basic Local Alignment Search Tool
NCBI Molecular Biology Resources
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

NCBI Minicourses BLAST Quick Start

NCBI Minicourses Algorithm Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Search Basics Introduction Words & extensions

NCBI Minicourses Basic Local Alignment Search Tool Compare protein or nucleic acid sequences to protein or nucleic acid databases in NCBI databases in local databases (standalone BLAST) to a single protein or nucleotide sequence (BLAST 2 Sequences, or pairwise BLAST) Compare protein or nucleic acid sequences to protein or nucleic acid databases in NCBI databases in local databases (standalone BLAST) to a single protein or nucleotide sequence (BLAST 2 Sequences, or pairwise BLAST)

NCBI Minicourses BLAST Web Searches, 2007

NCBI Minicourses local alignments; isolated regions of similarity fast and sensitive breaks the query sequence into “words” word matches to database sequences are extended in both directions local alignments; isolated regions of similarity fast and sensitive breaks the query sequence into “words” word matches to database sequences are extended in both directions Basic Local Alignment Search Tool

NCBI Minicourses Global vs Local Alignment Seq 1 Seq 2 Seq 1 Seq 2 Global alignment Local alignment

NCBI Minicourses Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa) Global Seq1:1 W--HEREISWALTERNOW 16 W HERE Seq2:1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

NCBI Minicourses How BLAST Works 1.Make lookup table of “words” for query 2.Scan database for hits 3.Extend alignment both directions –Ungapped extensions of hits (initial HSPs) –Gapped extensions (no traceback) –Gapped extensions (traceback - alignment details) 1.Make lookup table of “words” for query 2.Scan database for hits 3.Extend alignment both directions –Ungapped extensions of hits (initial HSPs) –Gapped extensions (no traceback) –Gapped extensions (traceback - alignment details)

NCBI Minicourses Nucleotide Words ATGCTGCTAGTCGATGACGTAGCTA ATGCTGCTAGT TGCTGCTAGTC GCTGCTAGTCG... Make a lookup table based on the word size. 11-mer

NCBI Minicourses Protein Words AIEKCYTGCTLAQEADDTA AIE IEK EKC Lookup table, including neighborhood words, is based on word size, score matrix, and threshold. KCY CYT LEK, IDK, IQK, IER, IDR, etc Neighborhood words …

NCBI Minicourses A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X Scoring Systems - Proteins ( BLOSUM62) I / IL / IL / I IEK: keep LEK? (threshold = 11) IEK = 14 LEK = 12 E / EK / K

NCBI Minicourses Word Hits & Extensions ATGCTGCTAGTCGATGACGTAGCTA Nucleotide: one exact match GCTGCTAGTCG Protein: two matches within 40 residues AIEKCYTGCTLAQEADDTA IDK EAD

NCBI Minicourses BLASTP Summary YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words Drop-off score = Highest score – current score -X X dropoff value for gapped alignment (in bits) blastn 30, megablast 20, tblastx 0, all others 15

NCBI Minicourses YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I High-scoring pair (HSP) HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words BLASTP Summary

NCBI Minicourses Search Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Programs Databases Submit a search Interpret the results BLAST options Format options Examples

NCBI Minicourses BLAST Programs blastnnucleotide X nucleotide blastp protein X protein 6 frame, translated nucleotide searches blastx nucleotide X protein tblastx nucleotide X nucleotide tblastn protein X nucleotide What is your goal?

NCBI Minicourses Word Size Trade-off: sensitivity vs speed Too fast for you?

NCBI Minicourses More BLAST Programs MegaBLAST - batch nucleotide queries - very similar sequences Discontiguous MegaBLAST - batch nucleotide queries - divergent sequences Word Size matches out of 18

NCBI Minicourses Templates for Discontiguous Words W = 11, t = 16, coding: W = 11, t = 16, non-coding: W = 12, t = 16, coding: W = 12, t = 16, non-coding: W = 11, t = 18, coding: W = 11, t = 18, non-coding: W = 12, t = 18, coding: W = 12, t = 18, non-coding: W = 11, t = 21, coding: W = 11, t = 21, non-coding: W = 12, t = 21, coding: W = 12, t = 21, non-coding: Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5 W = word size; # matches in template t = template length

NCBI Minicourses Search Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Programs Databases Submit a search Interpret the results BLAST options Format options Examples

NCBI Minicourses Nucleotide BLAST Databases nr (nt) –Traditional GenBank Divisions –NM_ and XM_ RefSeqs refseq_rna –NM_, XM_, NR_ refseq_genomic –NC_, NT_, NG_ est –EST Division htgs –HTG division dbsts –STS Division chromosome –NC genomic records gss –GSS division pat –PAT Division wgs –wgs entries from traditional divisions pdb –Nucleotide sequences from structures env_nt –environmental samples

NCBI Minicourses New Nucleotide Databases Refseq_genomic + refseq_rna

NCBI Minicourses New Nucleotide Databases

NCBI Minicourses Protein BLAST Databases Protein nr traditional GenBank records refseq = NP_, XP_ swissprot pdb pat env_nr Protein nr traditional GenBank records refseq = NP_, XP_ swissprot pdb pat env_nr nr = nr

NCBI Minicourses BLAST Databases: Genome-specific

NCBI Minicourses Genome-specific: Map Viewer

NCBI Minicourses Basic Local Alignment Search Tool What’ s New? Save your searches

NCBI Minicourses Search Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Programs Databases Submit a search Interpret the results BLAST options Format options Examples

NCBI Minicourses Setting up a BLAST Search Identifier or sequence

NCBI Minicourses

BLAST Output

NCBI Minicourses Taxonomy Reports Tax BLAST Report

NCBI Minicourses Taxonomy Reports Tax BLAST Report

NCBI Minicourses New Output View

NCBI Minicourses Graphic Overview

NCBI Minicourses New Output View Transcript & genomic hits separated Results can be sorted Pseudogene, chr 9 Functional gene, chr 1

NCBI Minicourses Sorting Results Functional gene now first Resorted by Total score

NCBI Minicourses Sorting Hits: by Score Sort by Score: longest exon usually first Sort by Score: longest exon usually first

NCBI Minicourses Sorting Hits: by Query Start Sort by Query start: Proper exon order Sort by Query start: Proper exon order

NCBI Minicourses Search Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Programs Databases Submit a search Interpret the results BLAST options Format options Examples …con’t

NCBI Minicourses Where do Scores Come From?

NCBI Minicourses Scoring Matrices A T C G A A T T C C G G Nucleotide search: identity matrix CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9* = 10* CACGTAGCAAGCTTG-GTGTCA * ignores gap costs

NCBI Minicourses Percent Accepted Mutation (PAM) Blocks Substitution Matrix (BLOSUM) Percent Accepted Mutation (PAM) Blocks Substitution Matrix (BLOSUM) Scoring Matrices Protein search: substitution matrix

NCBI Minicourses PAM Matrices global alignments of closely related proteins PAM1 calculated from sequences with no more than 1% divergence Other PAM matrices are extrapolated from PAM1 Examples: PAM30 and PAM70 global alignments of closely related proteins PAM1 calculated from sequences with no more than 1% divergence Other PAM matrices are extrapolated from PAM1 Examples: PAM30 and PAM70

NCBI Minicourses BLOSUM Matrices local alignments all matrices based on observed alignments; not extrapolated BLOSUM 62 calculated from sequences with no more than 62% identity Examples: BLOSUM45, BLOSUM62 and BLOSUM80 BLOSUM62: very good for detecting weak protein similarities BLOSUM62 is the default matrix in BLAST local alignments all matrices based on observed alignments; not extrapolated BLOSUM 62 calculated from sequences with no more than 62% identity Examples: BLOSUM45, BLOSUM62 and BLOSUM80 BLOSUM62: very good for detecting weak protein similarities BLOSUM62 is the default matrix in BLAST

NCBI Minicourses A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 D F Negative for less likely substitutions D Y F Positive for more likely substitutions

NCBI Minicourses A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 J (leucine or isoleucine) and O (pyrrolysine)

NCBI Minicourses Substitution Matrices Query LengthSubstitution Matrix <35PAM PAM BLOSUM80 >85BLOSUM62 Query LengthSubstitution Matrix <35PAM PAM BLOSUM80 >85BLOSUM62 A Provisional Table

NCBI Minicourses Where do Expect Values Come From?

NCBI Minicourses Local Alignment Statistics E = Kmne - S or E = mn2 -S’ K = scale for search space = scale for scoring system S’ = bitscore = ( S - lnK)/ln2 m = query length n = database length Expect Value E = number of database hits you expect to find by chance, ≥ S Expect Value E = number of database hits you expect to find by chance, ≥ S More info: The Statistics of Sequence Similarity Scores E is dependent on m x n (search space)

NCBI Minicourses Short query = CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA More info: The Statistics of Sequence Similarity Scores E is dependent on m x n (search space) low scorehigh Expect

NCBI Minicourses Search Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Programs Databases Submit a search Interpret the results BLAST options Format options Examples

NCBI Minicourses Advanced Options Limit to Organism all[filter] NOT ma Example Entrez Queries all[Filter] NOT mammalia[organism] chimpanzee[organism] srcdb refseq reviewed[properties] Nucleotide only: biomol mrna[properties] biomol genomic[properties] OtherAdvanced –e 10000expect value -v 2000descriptions -b 2000alignments Example Entrez Queries all[Filter] NOT mammalia[organism] chimpanzee[organism] srcdb refseq reviewed[properties] Nucleotide only: biomol mrna[properties] biomol genomic[properties] OtherAdvanced –e 10000expect value -v 2000descriptions -b 2000alignments -e v 2000

NCBI Minicourses For short queries: increase e and lower W Other Advanced Options -e expect value -W word size -v descriptions -b alignments – e –W 7 –v 2000 –b 1500

NCBI Minicourses BLASTP – “…short, nearly exact matches”

NCBI Minicourses Search Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Programs Databases Submit a search Interpret the results BLAST options Format options Examples

NCBI Minicourses ‘New’ Formatter Select lower case Select red

NCBI Minicourses BLAST Output: Alignments & Filter low complexity sequence filtered positives, from score matrix

NCBI Minicourses Alignment View Options

NCBI Minicourses Alignment View: Flat Query-Anchored with Identities

NCBI Minicourses Alignment View: Hit Table # Fields: query id, subject ids, % identity, % positives, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score

NCBI Minicourses Search Basics Programs Databases Submit a search Interpret the results BLAST options Format options Examples Programs Databases Submit a search Interpret the results BLAST options Format options Examples

NCBI Minicourses Examples Protein searches more sensitive than nucleotide searches - redundancy of the genetic code blastn, megablast best when searching within same organism

NCBI Minicourses BLASTN BLASTN vs BLASTP Compare YadC from E. coli K12 and O157 using BLAST 2 Sequences Compare YadC from E. coli K12 and O157 using BLAST 2 Sequences BLASTP

NCBI Minicourses An alignment BLAST cannot make: 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp. BLAST is a shortcut...

NCBI Minicourses BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3

NCBI Minicourses Anopheles gambiae mitochondrion, complete genome. ACCESSION NC_ GI: Homo sapiens mitochondrion, complete genome. ACCESSION NC_ GI: BLASTN vs TBLASTX blastn Anopheles gambiae Homo sapiens tblastx Anopheles gambiae Homo sapiens

NCBI Minicourses Hands-On

NCBI Minicourses Position-Specific Score Matrix DAF-1 Serine/Threonine protein kinases catalytic loop 174 PSSM scores 5 4

NCBI Minicourses A R N D C Q E G H I L K M F P S T W Y V 435 K E S N K P A M A H R D I K S K N I M V K N D L Position-Specific Score Matrix catalytic loop