BLAST Slides adapted & edited from a set by

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BLAST.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
An Introduction to Bioinformatics
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Comp. Genomics Recitation 3 The statistics of database searching.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
©CMBI 2005 Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
"Nothing in biology makes sense except in the light of evolution"
Blast Basic Local Alignment Search Tool
Identifying templates for protein modeling:
Sequence Based Analysis Tutorial
BLAST.
Basic Local Alignment Search Tool
Genbank Founded in 1982 at the Los Alamos National Laboratory
Point Specific Alignment Methods
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using BLAST to Teach ‘‘E-value-tionary’’ Concepts. PLoS Biology 9(2):e1001014

Starts with a Query Sequence in FASTA Format Amino acid sequence: >ribosomal protein L7/L12 [Thiomicrospira crunogena XCL-2] MAITKDDILEAVANMSVMEVVELVEAMEEKFGVSAAAVAVAGPAGDAGAA GEEQTEFDVVLTGAGDNKVAAIKAVRGATGLGLKEAKSAVESAPFTLKEG VSKEEAETLANELKEAGIEVEVK Nucleotide sequence: >gi|118139508:333094-333465 Thiomicrospira crunogena XCL-2 ATGGCAATTACAAAAGACGATATTTTAGAAGCAGTTGCTAACATGTCAGTAATGGAAGTTGTTGAACTTGTTGAAGCAATGGAAGAGAAGTTTGGTGTTTCTGCAGCAGCAGTTGCGGTTGCAGGTCCTGCAGGTGATGCTGGCGCTGCTGGTGAAGAACAAACAGAGTTTGACGTTGTCTTGACTGGTGCTGGTGACAACAAAGTTGCAGCAATCAAAGCCGTTCGTGGCGCAACTGGTCTTGGGCTTAAAGAAGCGAAAAGTGCAGTTGAAAGTGCACCATTTACGCTTAAAGAGGGTGTTTCTAAAGAAGAAGCAGAAACTCTTGCAAATGAGCTTAAAGAAGCAGGTATTGAAGTCGAAGTTAAATAA Note the description line Starts with “>”, ends with carriage return Not read as sequence data 2 Kerfeld and Scott, PLoS Biology 2011

NCBI BLAST Interface (blastp: for protein-protein alignments) (Paste FASTA format sequence here) 3 Kerfeld and Scott, PLoS Biology 2011

NCBI BLAST Results Page: Potential homologs retrieved from database 4 Kerfeld and Scott, PLoS Biology 2011

Overview of BLAST Segment the query sequence into short “words” Use the query sequence segments to scan the database for matching sequences Extend the matched segments in either direction to find local alignments. Create a list of hits & alignments, with best matches first 5 Kerfeld and Scott, PLoS Biology 2011

BLAST Phase 1: Segment the query sequence and identify words that could form potential alignments Segment the query sequence into pieces (“words”) Default word length: 3 amino acids or 11 nucleic acids Create a list of synonyms and their scores for comparing query words to target words Uses scoring matrix to calculate scores for synonyms that might be found in the database Save the scores (and synonyms) exceeding a given threshold T 6 Kerfeld and Scott, PLoS Biology 2011

Require two matches (“hits”) within the target sequence BLAST Phase 2: Using the query sequence word list, scan the database for synonyms (hits) Scan the database for matches to the word list with acceptable T values Require two matches (“hits”) within the target sequence Set aside sequences with matches above T for further analysis Words SWI PGI …………..SWITEASFSPPGIM….. Possible match from the database 7 Kerfeld and Scott, PLoS Biology 2011

BLAST Phase 3: Extending the hits Search 5’ and 3’ of the word hit on both the query and target sequence Add up the score for sequence identity or similarity until value exceeds S Alignment is dropped from subsequent analyses if value never exceeds S 8 Kerfeld and Scott, PLoS Biology 2011

So, to summarize: BLAST segments query sequence into “words” and scores potential word matches Scans this list for alignments that meet a threshold score T uses a scoring matrix to calculate this (e.g., BLOSUM62) Uses this list of ‘synonyms’ to scan the database Extends the alignments to see if they meet a cutoff score S uses a scoring matrix to calculate this Reports the alignments that exceed S 9 Kerfeld and Scott, PLoS Biology 2011

PAM and BLOSUM Matrices Scoring matrices are calibrated to capture different degrees of sequence similarity In practice, this means choosing a matrix appropriate to the suspected degree of sequence identity between the query and its hits PAM: empirically derived for close relatives BLOSUM: empirically derived for distant relatives 10 Kerfeld and Scott, PLoS Biology 2011

Raw Scores (S values) from an Alignment S = (SMij) – cO – dG, where M = score from a similarity matrix for a particular pair of amino acids (ij) c = number of gaps O = penalty for the existence of a gap d = total length of gaps G = per-residue penalty for extending the gap 11 Kerfeld and Scott, PLoS Biology 2011

Limitations of Raw Scores S values depend on the substitution matrix, gap penalties Impossible to compare S values from hits retrieved from BLAST searches when different matrices and gap penalties are used 12 Kerfeld and Scott, PLoS Biology 2011

Going from Raw Scores to Bit Scores S’ = [lS-ln(K)]/ln(2) where S’ = bit score l and K = normalizing parameters of the specific matrices and search spaces Larger raw scores result in larger bit scores Allows user to compare scores obtained by using different matrices and search spaces (as in 0 vs 1) 13 Kerfeld and Scott, PLoS Biology 2011

Limitations of Bit Scores How high does a bit score have to be to suggest common ancestry? Hard to evaluate hits as homologs or not, based solely on bit scores 14 Kerfeld and Scott, PLoS Biology 2011

E-value Number of distinct alignments with scores greater than or equal to a given value expected to occur in a search against a database of known size, based solely on chance, not homology. Large E-values suggest that the query sequence and retrieved sequence similarities are due to chance Small E-values suggest that the sequence similarities are due to shared ancestry (or potentially convergent evolution) 15 Kerfeld and Scott, PLoS Biology 2011

Calculating E-values E = (n × m) / 2S’ where m = effective length of the query sequence = length of query sequence – average length of alignments (Controls for fewer alignments occurring at the ends of the query sequence) n = effective length of the database sequence (total number of bases) The value of E decreases exponentially with increasing S 16 Kerfeld and Scott, PLoS Biology 2011

BLAST Parameters Expect Word size Matrix Gap costs Filter Mask 17 Kerfeld and Scott, PLoS Biology 2011

E value Threshold Alignments will be reported with E-values less than or equal to the expect values threshold Setting a larger E threshold will result in more reported hits Setting a smaller E threshold will result in fewer reported hits 18 Kerfeld and Scott, PLoS Biology 2011

Filter and Mask Filter: Low complexity Mask: Lower case Replaces the following with N (nucleotides) or X (amino acids) Dinucleotide repeats Amino acid repeats Leader sequences Stretches of hydrophobic residues Mask: Lower case Replaces lowercase letters in sequence with N or X Lowercase letters typically indicate base or amino acid not known with certainty 19 Kerfeld and Scott, PLoS Biology 2011

Parameter Summary is Found at the Bottom of the Output….. Kerfeld and Scott, PLoS Biology 2011

Evaluating BLAST Results 21 Kerfeld and Scott, PLoS Biology 2011

Examine the BLAST Alignment Does it cover the whole length of both the query and subject sequences? 22 Kerfeld and Scott, PLoS Biology 2011

High E-value: Discovery of a Distant Homolog or Garbage? Take another look at the target (subject) sequence(s) that have high E-values Similar length? Recurring motifs? Similar biological functions? Use target sequences as query sequences for another BLAST search Does the original query sequence come up in report? 23 Kerfeld and Scott, PLoS Biology 2011