NCBI FieldGuide MapViewer Genome Resources and Sequence SimilarityLocusLink UniGene Homologene Basic Local Alignment Search Tool Gene database.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Introduction to Bioinformatics
NCBI Minicourses BLAST Quick Start
NCBI Minicourses BLAST Quick Start
Heuristic alignment algorithms and cost matrices
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BLAST.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
BLAST : Basic local alignment search tool B L A S T !
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Peter Cooper Using NCBI BLAST.
NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Peter Cooper Using NCBI BLAST.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence alignment, Part 2
NCBI Molecular Biology Resources
Genome of the week Bacillus subtilis Gram-positive soil bacterium
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

NCBI FieldGuide MapViewer Genome Resources and Sequence SimilarityLocusLink UniGene Homologene Basic Local Alignment Search Tool Gene database

NCBI FieldGuide Basic Local Alignment Search Tool

NCBI FieldGuide Topics Why use sequence similarity? BLAST algorithm –blastn, blastp, megablast BLAST statistics BLAST output Examples

NCBI FieldGuide Why Do We Need Sequence Similarity Searching? To identify and annotate sequences To evaluate evolutionary relationships Other: –model genomic structure (e.g., Spidey) –check primer specificity in silico : NCBI’s tool

NCBI FieldGuide Global vs Local Alignment Seq 1 Seq 2 Seq 1 Seq 2 Global alignment Local alignment

NCBI FieldGuide Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa) Global Seq1:1 W--HEREISWALTERNOW 16 W HERE Seq2:1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

NCBI FieldGuide Global programming algorithm

NCBI FieldGuide Global Dynamic Programming Full sequence must be aligned Gaps at ends are penalized as much as internal ones F(n,m) is the best score for alignment Traceback can give >1 correct alignment Used to examine closely related sequences dynprog/dynamic.htmlhttp:// dynprog/dynamic.html

NCBI FieldGuide Local Alignment – Smith-Waterman

NCBI FieldGuide Local alignments - How Notice the top row and left column are now filled with 0 (if the best alignment has a negative score, it’s better to start a new one) The alignment can end anywhere in the matrix Instead of starting at F (n, m), start traceback at highest value of F (i, j); the traceback ends when you hit a 0

NCBI FieldGuide Heuristic alignment algorithms Shortcuts are important –Searching a sequence length of 1000 against a database with 10 8 residues requires approximately matrix cells. At ten million matrix cells a second, it would take about 3 hours. BLAST – the heuristic is based on that true match alignments are very likely to contain somewhere within them a sort stretch of identities. Look for short stretches to serve as seeds to extend.

NCBI FieldGuide Seeding BLAST takes your query and breaks it down into words of fixed length (3 for protein, 11 for nucleotide) It scans through a database looking for a word from the query set with some minimum score T, when it finds it, it begins a “hit” extension to extend the possible match in both directions, stopping at the maximum scoring extension.

NCBI FieldGuide Extension The seeds are extended to locally optimal pairs, whose scores cannot be improved by extension or trimming. These locally optimal alignments are called high scoring segment pairs or HSP’s Sometimes you return only a portion of a sequence – this is the reason you need to look carefully at your BLAST alignments

NCBI FieldGuide Alignment example The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him. Matches = +1; Mismatches = -1; ignore spaces and do not allow gaps. Assume the seed is the capital T, extend the alignment You’ll hit a mismatch c/e should you continue and how far? Generate a variable X to measure how far the score drops off. Set X = 5 and try the alignment… Set X = 2 and try again … A large X value will increase the speed, however, speed is often modulated by word size and other parameters…

NCBI FieldGuide Gapped BLAST – a time saver Extension is costly, now have a two hit (gapped) BLAST where you require two hits within a distance (A) A gapped extension takes much longer to execute than ungapped, but overall run fewer extensions – time saver Gapped BLAST requires two non-overlapping hits of at least score (T) within distance A of one another before ungapped extension of second hit T is adjustable, higher the T then the smaller the search space

NCBI FieldGuide Evaluation Once seeds are extended to generate alignments, these alignments are tested for statistical significance. Can establish thresholds for reporting

NCBI FieldGuide The Flavors of BLAST Standard BLAST –traditional “contiguous” word hit –position independent scoring –nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) Megablast –optimized for large batch searches –can use discontiguous words PSI-BLAST –constructs PSSMs automatically; uses as query –very sensitive protein search RPS BLAST –searches a database of PSSMs –tool for conserved domain searches

NCBI FieldGuide BLASTN variations BLASTN seeds are always identical words; T is never used To make BLASTN faster, increase word size, to make it more sensitive decrease word size MegaBLAST increases word size to 28 The minimum word size is 7 fhttp://monod.uwaterloo.ca/papers/02ph.pd f

NCBI FieldGuide BLASTP implementation To make searches faster, set word size to 3 and T to a large value (999), which removes all potential neighborhood words (two-hit distance is 40 amino acids by default) Affine gaps –Decreased penalty for gap extension relative to gap introduction

NCBI FieldGuide Also, FASTA Similar to Gapped BLAST – except bigger neighborhood Generates a lookup table to locate all identically matching words of length ktup protein 1-2, DNA 4-6 Once identified, looks for diagonals with many mutually supporting word matches Extensions similar to BLAST

NCBI FieldGuide Scoring Matrices Scoring matrix specifies a score, s ij, for aligning sequence I with sequence II. Choice of matrix depends on the divergence level of desired/expected hits. Examples: PAM, BLOSUM Both can be modified for different divergence levels (eg, BLOSUM40, BLOSUM62) Advice: try several matrices when possible.

NCBI FieldGuide Dayhoff Family of Matrices Dayhoff model measures sequence evolution in units of “PAMs” –One PAM unit represents the evolutionary distance in which 1% of the amino acids have changed. Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change) –Mutability of alanine was defined to be 100.

NCBI FieldGuide Dayhoff Family of Matrices Problems with the original Dayhoff scheme It does not consider the genetic code. –Not all amino acid substitutions can occur by a single nucleotide substitution event. Parameters were estimated from a small sample of closely related proteins. Evolution at the “average site” of the “average protein” is being modeled.

NCBI FieldGuide

BLOSUM Scoring Matrices Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies ofsubstitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff)

NCBI FieldGuide

Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. –DNA vs DNA –DNA translation vs Protein –Protein vs Protein –Protein vs DNA translation –DNA translation vs DNA translation www, standalone, and network clients Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. –DNA vs DNA –DNA translation vs Protein –Protein vs Protein –Protein vs DNA translation –DNA translation vs DNA translation www, standalone, and network clients Basic Local Alignment Search Tool

NCBI FieldGuide How BLAST Works Make lookup table of “words” for query Scan database for hits Ungapped extensions of hits (initial HSPs) Gapped extensions (no traceback) Gapped extensions (traceback; alignment details) Make lookup table of “words” for query Scan database for hits Ungapped extensions of hits (initial HSPs) Gapped extensions (no traceback) Gapped extensions (traceback; alignment details) X dropoff (X 1 ) X dropoff (X 2 ) X dropoff (X 3 )

NCBI FieldGuide Nucleotide Words GTACTGGACATGGACCCTACAGGAA Query : GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT Make a lookup table of words 11-mer megablast 711 blastn minimumdefaultWORD SIZE

NCBI FieldGuide Protein Words GTQITVEDLFYNIATRRKALKN Query : Neighborhood Words LTV, MTV, ISV, LSV, etc. GTQ TQI QIT ITV TVE VED EDL DLF... Make a lookup table of words Word size = 3 (default) Word size can only be 2 or 3 [ -f 11 = blastp default ]

NCBI FieldGuide Minimum Requirements for a Hit Nucleotide BLAST requires one exact match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN ATCGCCATGCTTAATTGGGCTT CATGCTTAATT neighborhood words one exact match two matches [ -A 40 = blastp default ]

NCBI FieldGuide BLASTP Summary YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 High-scoring pair (HSP) HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words

NCBI FieldGuide Scoring Systems - Nucleotides A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 – T –3 –3 –3 +1 Identity matrix CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA [ -r 1 -q -3 ]

NCBI FieldGuide Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST

NCBI FieldGuide A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights D F Negative for less likely substitutions D Y F Positive for more likely substitutions

NCBI FieldGuide Position-Specific Score Matrix DAF-1 Serine/Threonine protein kinases catalytic loop 174 PSSM scores 5 4

NCBI FieldGuide A R N D C Q E G H I L K M F P S T W Y V 435 K E S N K P A M A H R D I K S K N I M V K N D L Position-Specific Score Matrix catalytic loop [ >./blastpgp -i NP_ d nr -j 3 -Q NP_ pssm ]

NCBI FieldGuide Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Score (S) Alignments (applies to ungapped alignments) E = Kmne - S or E = mn2 -S’ K = scale for search space = scale for scoring system S’ = bitscore = ( S - lnK)/ln2 Expect Value E = number of database hits you expect to find by chance, ≥ S your score expected number of random hits More info:

NCBI FieldGuide Gapped Alignments  Gapping provides more biologically realistic alignments  Gapped BLAST parameters are simulated for each scoring matrix  Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)

NCBI FieldGuide An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.

NCBI FieldGuide BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3

NCBI FieldGuide Other BLAST Algorithms Megablast Discontiguous Megablast PSI-BLAST

NCBI FieldGuide Megablast: NCBI’s Genome Annotator Long alignments of similar DNA sequences Greedy algorithm Concatenation of query sequences Faster than blastn; less sensitive

NCBI FieldGuide Discontiguous Megablast Uses discontiguous word matches Better for cross-species comparisons

NCBI FieldGuide Discontiguous (Cross-species) MegaBLAST

NCBI FieldGuide Discontiguous Word Options

NCBI FieldGuide Templates for Discontiguous Words W = 11, t = 16, coding: W = 11, t = 16, non-coding: W = 12, t = 16, coding: W = 12, t = 16, non-coding: W = 11, t = 18, coding: W = 11, t = 18, non-coding: W = 12, t = 18, coding: W = 12, t = 18, non-coding: W = 11, t = 21, coding: W = 11, t = 21, non-coding: W = 12, t = 21, coding: W = 12, t = 21, non-coding: Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5 W = word size; # matches in template t = template length

NCBI FieldGuide BLAST Databases: Nucleic Acid  nr (nt) traditional GenBank divisions NM_ and XM_ RefSeqs  dbest EST division  htgs HTG division  gss GSS division  chromosome NC_ RefSeqs  env_nr environmental sample[filter] e.g., 16S rRNA

NCBI FieldGuide BLAST Databases: Protein nr (non-redundant protein sequences)  GenBank CDS translations  NP_ RefSeqs  Outside databases PIR, Swiss-Prot, PRF  PDB (sequences from structures) env_nr (environmental sample[filter])

NCBI FieldGuide

Web BLAST: BLASTP >Mutated in Colon Cancer IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILE VQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGS DKVYAHQMVRTDSREQKLDAFLQPLSKPLSS 1 1. Paste in the query sequence 2 2. Select the appropriate db 3. BLAST 3

NCBI FieldGuide Format Options

NCBI FieldGuide BLAST Formatting Page BLASTQ3

NCBI FieldGuide RPS-BLAST (CD search) Results Summary partial sequence partial domain

NCBI FieldGuide RPS-BLAST Results (CDD) DNA_mis_repair complete sequence

NCBI FieldGuide BLAST Output: Graphic Overview Sort results by taxonomy same database sequence

NCBI FieldGuide BLAST Output: Descriptions sorted by e values 8 X Bacterial mismatch repair proteins Linkouts E value cutoff GEO UniGene Structure

NCBI FieldGuide BLAST Output: Alignments >gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615 Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%) Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ L Sbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338 >gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615 Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%) Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ L Sbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338 positive (conservative) substitution negative substitution gap

NCBI FieldGuide BLAST Output: Alignments & Filter low complexity sequence filtered

NCBI FieldGuide Advanced Options Limit to Organism protein all[filter] A Example Entrez Queries proteins all[Filter] NOT mammalia[Organism] ray finned fishes[Organism] srcdb refseq[Properties] Nucleotide only: biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced –e 10000expect value -v 2000descriptions -b 2000alignments Example Entrez Queries proteins all[Filter] NOT mammalia[Organism] ray finned fishes[Organism] srcdb refseq[Properties] Nucleotide only: biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced –e 10000expect value -v 2000descriptions -b 2000alignments Filter options -e v 2000

NCBI FieldGuide PSI-BLAST Example: Confirming relationships of purine nucleotide metabolism proteins Position-specific Iterated BLAST

NCBI FieldGuide >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK PSI-BLAST E value cutoff for PSSM

NCBI FieldGuide RESULTS: Initial BLASTP Same results as protein-protein BLAST; different format

NCBI FieldGuide Results of First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

NCBI FieldGuide Tenth PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme Check to add to PSSM

NCBI FieldGuide Reverse PSI-BLAST (RPS)-BLAST

NCBI FieldGuide Adenosine/AMP Deaminase Domain AMP Deaminases......

NCBI FieldGuide PHI-BLAST >gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4 MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASE LIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHV IKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDI LKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEI ASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK [GA]xxxxGK[ST]

NCBI FieldGuide MegaBLAST vs Discontiguous MegaBLAST NM_ Homo sapiens cytochrome P450, family 3, subfamily A, polypeptide 4 (CYP3A4), transcript variant 1, mRNA (2768 letters) vs Drosophila

NCBI FieldGuide MegaBLAST vs Discontiguous MegaBLAST  MegaBLAST = “No significant similarity found.”  Discontiguous megaBLAST =

NCBI FieldGuide Genome BLAST

NCBI FieldGuide What is an HMM? Hidden Markov Model Important to know: it's a generalization of the profile in terms of statistical weights, rather than scores. At each position, the profile HMM gives the probability of finding a particular amino acid, an insertion, or a deletion HMMs are very popular in molecular data analysis but are not specific to this field

NCBI FieldGuide A Characterization Example How could we characterize this (hypothetical) family of nucleotide sequences? –Keep the Multiple Alignment –Try a regular expression [AT] [CG] [AC] [ACTG]* A [TG] [GC] But what about? – T G C T - - A G G vrs – A C A C - - A T C –Try a consensus sequence: A C A A T C Depends on distance measure Example borrowed from Salzberg, 1998

NCBI FieldGuide HMMs to the rescue! Transition probabilities Emission Probabilities

NCBI FieldGuide Insert (Loop) States

NCBI FieldGuide Scoring our simple HMM #1 - “T G C T - - A G G” vrs: #2 - “A C A C - - A T C” –Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]): #1 = Member #2: Member –HMM: #1 = Score of % #2 Score of 4.7% (Probability) #1 = Score of #2 Score of 6.7 (Log odds)

NCBI FieldGuide Standard Profile HMM Architecture Three types of states: –Match –Insert –Delete One delete and one match per position in model One insert per transition in model Start and end “dummy” states Example borrowed from Cline, 1999

NCBI FieldGuide Aligning and Training HMMs Training from a Multiple Alignment Aligning a sequence to a model –Can be used to create an alignment –Can be used to score a sequence –Can be used to interpret a sequence Training from unaligned sequences (not included in current HMMer package)

NCBI FieldGuide Training from an existing alignment This process what we’ve been seeing up to this point. –Start with a predetermined number of states in your HMM. –For each position in the model, assign a column in the multiple alignment that is relatively conserved. –Emission probabilities are set according to amino acid counts in columns. –Transition probabilities are set according to how many sequences make use of a given delete or insert state.

NCBI FieldGuide Remember the simple example Chose six positions in model. Highlighted area was selected to be modeled by an insert due to variability.

NCBI FieldGuide Aligning sequences to a model Now that we have a profile model, let’s use it! Try every possible path through the model that would produce the target sequence –Keep the best one and its probability.

NCBI FieldGuide A T C T C - C G A A G C T - - T G G T G T T C T C T A A A C T C - C G A A G C T C - C G A Profile HMMs A T G C Probability

NCBI FieldGuide A.8 C 0 G 0 T.2 A.2 C 0 G.6 T.2 A 0 C.8 G 0 T.2 A 0 C 0 G 0 T 1 A 0 C.8 G 0 T.2 A 0 C 0 G.8 T.2 A.8 C 0 G.2 T 0 A 0 C.8 G 0 T T T T T - T T T G  TTTG TTTT Score = 8.2 x Consensus score = 0.1 Scores generally calculated with base e logarithms

NCBI FieldGuide The HMM must first be “trained” using a database of known signals. Consensus sequences for all signals are needed. Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors. Transition probabilities between all connected states must be estimated. Pseudocounts prevent the “regular expression” problem of non-matching or zero probability of a given amino acid…

NCBI FieldGuide Gene Finding Software GENSCAN HMMGENE GENMARK GRAIL HMMs Neural Net

NCBI FieldGuide HMM resources UC Santa Cruz (David Haussler group) –SAM-02 server. Returns alignments, secondary structure predictions, HMM parameters, etc. etc. –SAM HMM building program (requires free academic license) Washington U. St. Louis (Sean Eddy group) –Pfam. Large database of precomputed HMM-based alignments of proteins –HMMer, program for building HMMs Gene finders and other HMMs (more later)

NCBI FieldGuide

NCBI FieldGuide

NCBI FieldGuide