A Field Guide part 2 National Center for Biotechnology Information

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University.
NCBI Minicourses BLAST Quick Start
NCBI Minicourses BLAST Quick Start
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
BLAST : Basic local alignment search tool B L A S T !
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Peter Cooper Using NCBI BLAST.
NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
NCBI FieldGuide A Field Guide part 2 Genome resources Sequence similarity Apr, 2007 Shandong University.
NCBI FieldGuide NCBI Molecular Biology Resources Part 2 November 2008 Peter Cooper.
NCBI FieldGuide MapViewer Genome Resources and Sequence SimilarityLocusLink UniGene Homologene Basic Local Alignment Search Tool Gene database.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Copyright OpenHelix. No use or reproduction without express written consent1.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Peter Cooper Using NCBI BLAST.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
NCBI Literature Databases: PubMed
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
What is BLAST? Basic BLAST search What is BLAST?
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
Sequence Similarity The bioinformatics for molecular biologists lecture series.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
A Practical Guide to NCBI BLAST
NCBI Molecular Biology Resources
Lecture 3.1 BLAST.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
BLAST.
Sequence alignment, Part 2
Basic Local Alignment Search Tool
NCBI Molecular Biology Resources
Basic Local Alignment Search Tool (BLAST)
Genome of the week Bacillus subtilis Gram-positive soil bacterium
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

A Field Guide part 2 National Center for Biotechnology Information February 14, 2006 UT-Health Science Center

GenBank Records Header The Flatfile Format Feature Table Sequence

A Typical GenBank Record LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004 DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNA ACCESSION NM_019570 VERSION NM_019570.3 GI:50811869 KEYWORDS . A Typical GenBank Record = Title

GenBank Record: Feature Table

GenBank Record: Feature Table, con’t. GenPept identifier

GenBank Record: sequence skip

Indexing for Nucleotide UID 59958365 Field Indexed Terms [primary accession] NM_001012399 [title] Bos taurus hemochromatosis (hfe), mRNA. [organism] Bos taurus [sequence length] 1168 [modification date] 2005/02/19 [properties] biomol mrna gbdiv mam srcdb refseq [accn] [orgn] [mdat] [prop]

Global Entrez Search: HFE

Entrez Nucleotide: HFE 137 records [Title] Not HFE

Smarter Query hfe[title] AND human[orgn] 42 records Curated HFE splice variants (11 total)

hfe[title] AND human[orgn] (con’t) Primary data

Preview/Index Gateway to Advanced Searches

Preview/Index

Preview/Index: Properties, srcdb

Preview/Index: Properties, srcdb …AND srcdb refseq[Properties]

Preview/Index: Properties, srcdb …AND srcdb ddbj/embl/genbank[Properties]

Database Queries #1 hfe 137 #2 hfe[title] AND human[orgn] 42 #3 #2 AND srcdb refseq[prop] 11 #4 #2 AND srcdb ddbj/embl/genbank[prop] 31 #5 #4 AND gbdiv pri[prop] 29 #4 #4 AND gbdiv est[prop] 2 Primate division gbdiv pri[prop] EST division gbdiv est[prop]

Molecule Queries #1 hfe 116 #2 hfe[title] AND human[orgn] 42 #3 #2 AND biomol mrna[prop] 29 #4 #2 AND biomol genomic[prop] 13 Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop]

srcdb refseq reviewed[prop] AND transcript[title] AND variant[title] More Queries… Fields are database-specific Entrez Nucleotide Reviewed RefSeqs with transcript variants: srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]

More Queries… Fields are database-specific Entrez Nucleotide Reviewed RefSeqs with transcript variants: srcdb refseq reviewed[prop] AND transcript[title] AND variant[title] Topoisomerase genes from Archaea: topoisomerase[gene name] AND archaea[organism] Entrez Gene Genes on human chromosome 2 with OMIM links 2[chromosome] AND human[organism] AND “gene omim”[filter] Membrane proteins linked to cancer: “integral to plasma membrane”[gene ontology] AND cancer[dis]

Other Entrez Databases UniGene: rat clusters that have at least one mRNA rat[organism] NOT 0[mrna count] SNP: uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn] UniSTS: markers on the Genethon map of human chromosome 12 Genethon[Map Name] AND human[organism] AND 12[chromosome] Structure: structures of bacterial kinases with resolutions below 2 Å bacteria[organism] AND kinase AND 000.00:002.00[resolution]

Genome Resources Genomic Biology Genomic Biology UniGene E-PCR Map Viewer Trace Archive

Genomic Biology

Gen Biol: Gen Resources

Map Viewer – Genome Annotation Updates

Gen Biol: Gen Resources

Genome Projects: microb

Genome Projects: microb 13 Eukaryotic Genome Sequencing Projects Selected: Complete – 0, Assembly – 2, In Progress - 11

Genome Projects: microb 13 Eukaryotic Genome Sequencing Projects Selected: Complete – 0, Assembly – 2, In Progress - 11

Gen Biol: Gen Resources

Gen Biol: Gen Resources

Gen Biol: Gen Resources

Gen Biol: Gen Resources

Gen Biol: Gen Resources

Genome Resources UniGene Genomic Biology E-PCR Map Viewer Trace Archive

Gene-oriented clusters of expressed sequences UniGene Gene-oriented clusters of expressed sequences Automatic clustering using MegaBlast Each cluster represents a unique gene Informed by genome hits Information on tissue types and map locations Useful for gene discovery and selection of mapping reagents

A Cluster of ESTs query 5’ EST hits 3’ EST hits

UniGene Collections

UniGene Collections Species UniGene

UniGene Hs build 188

UniGene Cluster Hs.95351 Lipase, hormone-sensitive (LIPE)

UniGene Cluster Hs.95351

UniGene Cluster Hs.95351: expression

UniGene Cluster Hs.95351: seqs

Get Sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

Genome Resources E-PCR Genomic Biology UniGene Map Viewer Trace Archive

E-PCR Genomic sequence here

Options

Results

reverse e-pcr

reverse e-pcr

reverse e-pcr

LY6G6D: lymphocyte antigen 6 complex, locus G6D reverse e-pcr Gene STS LY6G6D: lymphocyte antigen 6 complex, locus G6D

Genome Resources Map Viewer Genomic Biology UniGene E-PCR Trace Archive

List View

Human MapViewer adar

MapViewer: Human ADAR

3’ UTR 5’ UTR MV Hs ADAR

Maps & Options Maps & Options --Sequence maps-- --Cytogenetic maps-- Ab initio Assembly Repeats BES_Clone Clone NCI_Clone Contig Component CpG island dbSNP haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_Tag STS TCAG_RNA Transcript (RNA) Hs_UniGene Hs_EST --Cytogenetic maps-- Ideogram FISH Clone Gene_Cytogenetic Mitelman Breakpoint Morbid/Disease --Genetic Maps-- deCODE Genethon Marshfield --RH maps-- GeneMap99-G3 GeneMap99-GB4 NCBI RH Standford-G3 TNG Whitehead-RH Whitehead-YAC Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ssc_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variation = SNP

Component MapViewer Gene UniGene Repeats Customize the annotation

Phenotype Variation Gene Customize the annotation

Maps & Options Maps & Options DR52/53 = alternate MHC haplotypes on chr6; hsc_tcag = alternate chr7 assembly.

Chimp ADAR Human ADAR Mouse ADAR Customize the annotation

Genome Resources Trace Archive Genomic Biology UniGene E-PCR Map Viewer Trace Archive

Trace Archive Page

Ciona savignyi Traces

Trace Archive BLAST Page Potential access to sequences NOT yet in GenBank

Basic Local Alignment Search Tool

BLAST Web Searches, 2005 200,000

Precomputed BLAST Services Nucleotide or protein: Related Sequences BLAST link: BLink Transcript clusters: UniGene Protein homologs: HomoloGene

Link to Related Sequences

Related Sequences Most similar Least similar

BLink (BLAST Link)

BLink Output Best hits 3D structures CDD-Search

Why Is BLAST So Popular? Fast Local alignments - heuristic approach based on Smith Waterman Local alignments Statistical significance - Expect value Versatile - blastn, blastp, blastx, tblastn, tblastx, rps-blast, psi-blast - www, standalone, and network clients

Global vs Local Alignment Seq 1 Seq 2 Global alignment Seq 1 Seq 2 Local alignment

Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa) Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

How BLAST Works Make lookup table of “words” for query Scan database for hits Extend alignment both directions Ungapped extensions of hits (initial HSPs) Gapped extensions (no traceback) Gapped extensions (traceback - alignment details) -X X dropoff value for gapped alignment (in bits) (zero invokes default behavior) blastn 30, megablast 20, tblastx 0, all others 15 [Integer] default = 0 =X2 (in output) -y X dropoff value for ungapped extensions in bits (0.0 invokes default behavior) blastn 20, megablast 10, all others 7 [Real] default = 0.0 =X1 (in output) -Z X dropoff value for final gapped alignment in bits (0.0 invokes default behavior) blastn/megablast 50, tblastx 0, all others 25 [Integer] =X3 (in output)

Protein Words GTQITVEDLFYNIATRRKALKN GTQ TQI QIT ITV TVE VED EDL DLF Query: Word size = 3 (default) Word size can only be 2 or 3 GTQ TQI QIT ITV TVE VED EDL DLF ... Make a lookup table of words Neighborhood Words VTV, LTV, VSV, etc. VTV 12 LTV 11 VSV 8 Neighborhood score threshold

Highest score – current score BLASTP Summary Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 Neighborhood words Neighborhood score threshold T (-f) =11 YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I Drop-off score = Highest score – current score -X X dropoff value for gapped alignment (in bits) blastn 30, megablast 20, tblastx 0, all others 15

BLASTP Summary High-scoring pair (HSP) Final HSP example query words Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 Neighborhood words Neighborhood score threshold T (-f) =11 YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I High-scoring pair (HSP) Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP

Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 [ -r 1 -q -3 ] The default values for M and N, respectively 5 and -4, having a ratio of 1.25, correspond to about 47 nucleic acid PAMs, or about 58 amino acid PAMs; an M:N ratio of 1 corresponds to 30 nucleic acid PAMs or 38 amino acid PAMs. At higher than about 40 nucleic acid PAMs, or 50 amino acid PAMs, better sensitivity at detecting similarities between coding regions is expected by performing comparisons at the amino acid level (States et al., 1991). CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST

BLOSUM62 Negative for less likely substitutions D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X D F Negative for less likely substitutions Y F Positive for more likely substitutions

Position-Specific Score Matrix Serine/Threonine protein kinases catalytic loop 1 7 4 PSSM scores 5 DAF-1 abnormal DAuer Formation DAF-1, cell-surface receptor daf-1 [Caenorhabditis elegans] gi|25145039|ref|NP_499868.2

Position-Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3 catalytic loop cell-surface receptor daf-1 [Caenorhabditis elegans] ; gi|25145039|ref|NP_499868.2; -j3 vs nr >blastpgp -i NP_499868.2 -d nr -j 3 -Q NP_499868.pssm

Local Alignment Statistics Expect Value E = number of database hits you expect to find by chance, ≥ S (applies to ungapped alignments) E = Kmne-S or E = mn2-S’ K = scale for search space  = scale for scoring system S’ = bitscore = (S - lnK)/ln2 Lambda -- the expected increase in reliability of an alignment associated with a unit increase in alignment score. E (range 0 to infinity) can be extrapolated to an expectation over the entire database search, by converting the pairwise expectation to a probability (range 0-1) and multiplying the result by the ratio of the entire database size to the length of the database sequence: E_database = (1 - exp(-E)) D / d where D is the size of the database; d is the length of the matching database sequence; and the quantity (1 - exp(-E)) is the probability, P, corresponding to the expectation E for the pairwise comparison. As E approaches 0, E and P approach equality. Due to inaccuracy in the statistical methods as applied in the BLAST programs, whenever E and P are less than about 0.05, the two values can be practically treated as equal. Effective length: computed according to: Length_eff = MAX( Length_real - Lambda S / H , 1) where H is the relative entropy of the target and background residue frequencies (Karlin and Altschul, 1990). H : the information expected to be obtained from each pair of aligned residues in a real alignment that distinguishes the alignment from a random one. More info: The Statistics of Sequence Similarity Scores

An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.

An Alignment BLAST Can Make Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3 Solution: compare protein sequences; BLASTX BLAST 2 Sequences (blastx) output:

Other BLAST Algorithms Megablast Discontiguous Megablast PSI-BLAST PHI-BLAST

Megablast: NCBI’s Genome Annotator Long alignments of similar DNA sequences Greedy algorithm Concatenation of query sequences Faster than blastn; less sensitive

MegaBLAST & Word Size Trade-off: sensitivity vs speed blastn megablast 2 3 blastp 8 28 megablast 7 11 blastn minimum default WORD SIZE Consider for each search

Discontiguous Megablast Uses discontiguous word matches Better for cross-species comparisons

Templates for Discontiguous Words W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 W = word size; # matches in template t = template length Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

Discontiguous (Cross-species) MegaBLAST

Discontiguous Word Options Non-affine gap scores: gap open penalty = 0; gap extension penalty = match/2 - mismatch (mismatch < 0). Non-affine gapping parameters tend to yield alignments with more gaps, but the gap lengths are shorter. The latter formula is applied automatically if -G and -E command line parameters are both set to 0.

Disco. Megablast Example . . . Query: NM_078651 Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp) /note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2 Database: nr (nt), Mammalia[orgn] MegaBLAST = “No significant similarity found.” Discontiguous megaBLAST = numerous hits . . .

Ex: Discontiguous MegaBLAST

Ex: BLASTN

PSI-BLAST Position-specific Iterated BLAST Example: Confirming relationships of purine nucleotide metabolism proteins

PSI-BLAST E value cutoff for PSSM >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK 0.005 E value cutoff for PSSM

RESULTS: Initial BLASTP Same results as protein-protein BLAST; different format

Results of First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

Tenth PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme Check to add to PSSM

PHI-BLAST [GA]xxxxGK[ST] >gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4 MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASE LIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHV IKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDI LKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEI ASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK [GA]xxxxGK[ST]

What’s New?

BLAST Databases nr = nr Nucleotide refseq_rna = NM_*, XM_* refseq_genomic = NC_*, NG_* env_nt environmental sample[filter], e.g., 16S rRNA Protein refseq = NP_*, XP_* env_nr nr = nr

New Formatter Select lower case Select red

BLAST Output: Alignments & Filter low complexity sequence filtered

BLAST Output: CDS Feature

Advanced Options Limit to Organism Example Entrez Queries all[filter] NOT ma Example Entrez Queries all[Filter] NOT mammalia[Organism] ray finned fishes[Organism] srcdb refseq[Properties] Nucleotide only: biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced –e 10000 expect value -v 2000 descriptions -b 2000 alignments -e 10000 -v 2000

Genome BLAST

Genome BLAST via Map Viewer

Example: Human Genome BLAST TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAAC ATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGAC CTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC Human EST

Human Genome BLAST: Results

Human Genome BLAST: MapViewer Entrez Gene

Example: Mapping Oligos Onto a Genome ? >forward CCATGGCGACCCTGGAAAAGC >reverse CAGCAGCGGCTGTGCCTGCGG ? ?

Map Oligos Onto Genome forward primer reverse primer -W 7 –e 1000 >CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG forward primer reverse primer -W 7 –e 1000

Genome BLAST Results

Primer Alignments reverse primer forward primer

MapViewer

MapViewer

Sequence View (sv) forward reverse

Service Addresses BLAST blast-help@ncbi.nlm.nih.gov General Help info@ncbi.nlm.nih.gov Wayne Matten matten@ncbi.nlm.nih.gov