9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon1 BCB 444/544 Lecture 10 BLAST Details Plus some Gene Jargon #10_Sept12.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Last lecture summary.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Similarity Search 2005/10/ Autumn / YM / Bioinformatics.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment1 BCB 444/544 Lecture 11 First BLAST vs FASTA Plus some Gene Jargon Multiple Sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment1 BCB 444/544 Lecture 6 Try to Finish Dynamic Programming Global & Local Alignment.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Blast Basic Local Alignment Search Tool
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
#8 Finish DP, Scoring Matrices, Stats & BLAST
#7 Still more DP, Scoring Matrices
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
BCB 444/544 Lecture 9 Finish: Scoring Matrices & Alignment Statistics
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon1 BCB 444/544 Lecture 10 BLAST Details Plus some Gene Jargon #10_Sept12

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon2 √Mon Sept 10 - for Lecture 9 BLAST variations; BLAST vs FASTA, SW Chp 4 - pp √Wed Sept 12 - for Lecture 10 & Lab 4 Multiple Sequence Alignment (MSA) Chp 5 - pp Fri Sept 14 - for Lecture 11 Position Specific Scoring Matrices & Profiles Chp 6 - pp (but not HMMs) Good Additional Resource re: Sequence Alignment? Wikipedia: Required Reading (before lecture)

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon3 Assignments & Announcements - #1 Revised Grading Policy has been sent via Please review! √Mon Sept 10 - Lab 3 Exercise due 5 PM: to: Thu Sept 13 - Graded Labs 2 & 3 will be returned at beginning of Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon4 Review: Gene Jargon #1 (for HW2, 1c) Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes vs Introns = "intervening sequences" = segments of eukaryotic genes that "interrupt" exons Introns are transcribed into pre-RNA but are later removed by RNA processing & do not appear in mature mRNA so are not translated into protein

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon5 Assignments & Announcements - #2 Mon Sept 17 - Answers to HW#2 will be posted by 5 PM Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon6 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment √Evolutionary Basis √Sequence Homology versus Sequence Similarity √Sequence Similarity versus Sequence Identity √Methods - (Dot Plots, DP; Global vs Local Alignment) √Scoring Matrices (PAM vs BLOSUM) √Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon7 Local Alignment: Algorithm 1) Initialize top row & leftmost column of matrix with "0" 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores 3) Optimal score? in highest scoring cell(s) 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) This slide has been changed!

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon8 Local Alignment DP: Initialization & Recursion New Slide

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon9 A Few Words about Parameter Selection in Sequence Alignment Optimal alignment between a pair of sequences depends critically on the selection of substitution matrix & gap penalty function In using BLAST or similar software, it is important to understand and, sometimes, to adjust these parameters (default is NOT always best!) How do we pick parameters that give the most biologically meaningful alignments and alignment scores?

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon10 Calculating an Alignment Score using a Substitution Matrix & an Affine Gap Penalty Alignment score is sum of all match/mismatch scores (from substitution matrix) with an affine penalty subtracted for each gap a b c - - d a c c e f d => 24 - (10 + 2) = 12 Match score Gap opening + extension Alignment Score Values from substitution matrix

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon11 Chp 4- Database Similarity Searching SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching Unique Requirements of Database Searching Heuristic Database Searching Basic Local Alignment Search Tool (BLAST) FASTA Comparison of FASTA and BLAST Database Searching with Smith-Waterman Method

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon12 Sequence database Database searching Sequence comparison algorithm Query Sequence Target sequences ranked by score

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon13 Why search a database? Given a newly discovered gene, Does it occur in other species? Is its function known in another species? Given a newly sequenced genome, which regions align with genomes of other organisms? Identification of potential genes Identification of other functional parts of chromosomes Find members of a multigene family

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon14 Recall: There are 3 Basic Types of Alignment Algorithms? SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 1) Dot Matrix 2) Dynamic Programming Xiong: Chp 4 3) Word or k-tuple methods (BLAST & FASTA) Wikipedia: Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming.heuristic

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon15 Exhaustive vs Heuristic Methods Exhaustive - tests every possible solution guaranteed to give best answer (identifies optimal solution) can be very time/space intensive! e.g., Dynamic Programming (as in Smith-Waterman algorithm) Heuristic - does NOT test every possibility no guarantee that answer is best (but, often can identify optimal solution) sacrifices accuracy (potentially) for speed uses "rules of thumb" or "shortcuts" e.g., BLAST & FASTA

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon16 Why do we Need Fast Search Algorithms? Your query is 200 amino acids long (N) You are searching a non-redundant database, which currently contains >10 6 proteins (K) If proteins in database have avg length 200 aa (M), then:  Must fill in 200  200  10 6 = 4  DP entries!! 4  operations just to fill in the DP matrix! DP for pairwise alignment is O(NM) Searching in a database is O(NMK)  Need faster algorithms for searching in large databases!

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon17 FASTA vs BLAST FASTA user defines value for k = word length Slower, but more sensitive than BLAST at lower values of k, (preferred for searches involving a very short query sequence) BLAST family Family of different algorithms optimized for particular types of queries, such as searching for distantly related sequence matches BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy Both FASTA, BLAST are based on heuristics Tradeoff: Sensitivity vs Speed DP is slower, but more sensitive

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon18 Lab3: focus on BLAST B asic L ocal A lignment S earch T ool STEPS: 1.Create list of very possible "word" (e.g., 3-11 letters) from query sequence 2.Search database to identify sequences that contain matching words 3.Score match of word with sequence, using a substitution matrix 4.Extend match (seed) in both directions, while calculating alignment score at each step 5.Continue extension until score drops below a threshold (due to mismatches) High Scoring Segment Pair (HSP) - contiguous aligned segment pair (no gaps)

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon19 What are the Results of a BLAST Search? Original version of BLAST? List of HSPs called Maximum Scoring Pairs More recent, improved version of BLAST? Allows gaps: Gapped Alignment How? Allows score to drop below threshold, (but only temporarily)

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon20 Why is Gapped Alignment Harder? Without gaps, there are N+M-1 possible alignments between sequences of length N and M Once we start allowing gaps, there are many more possible arrangements to consider: abcbcd abcbcd abcbcd ||| | | ||| || || abc--d a--bcd ab--cd Becomes a very large number when we also allow mismatches, because we need to look at every possible pairing between elements: Roughly N M possible alignments! e.g.: for N=M=100, there are = possible alignments & 100 aa is a small protein!

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon21 BLAST - a few details Developed by Stephen Altschul at NCBI in 1990 Word length? Typically: 3 aa for protein sequence 11 nt for DNA sequence Substitution matrix? Default is BLOSUM62 Can change under Algorithm Parameters Can choose other BLOSUM or PAM matrices Change other parameters here, too Stop-Extension Threshold? Typically: 22 for proteins 20 for DNA

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon22 BLAST - Statistical Significance? 1.E-value: E = m x n x P m = total number of residues in database n = number of residues in query sequence P = probability that an HSP is result of random chance lower E-value, less likely to result from random chance, thus higher significance 2.Bit Score: S' normalized score, to account for differences in size of database (m) & sequence length(n) - more later 3. Low Complexity Masking remove repeats that confound scoring - more sooner

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon23 BLAST algorithms can generate both "global" and "local" alignments Global alignment Local alignment

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon24 BLAST - a Family of Programs: Different BLAST "flavors" BLASTP - protein sequence query against protein DB BLASTN - DNA/RNA seq query against DNA DB (GenBank) BLASTX - 6-frame translated DNA seq query against protein DB TBLASTN - protein query against 6-frame DNA translation TBLASTX - 6-frame DNA query to 6-frame DNA translation PSI-BLAST - protein "profile" query against protein DB PHI-BLAST - protein pattern against protein DB Newest: MEGA-BLAST - optimized for highly similar sequences Which tool should you use?

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon25 Review: Gene Jargon #2.1 6-Frame translated DNA Sequence? Remember GeneBoy exercise?

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon26 Review: Gene Jargon #2.2 6-Frame translated DNA Sequence? Try NCBI tools: Or - for some Biology review re: DNA/RNA & ORFs, see next 3 slides borrowed from EMBL-EBI:

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon27 Review: Gene Jargon #2.3 DNA Strands

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon28 Review: Gene Jargon #2.4 RNA Strands - copied from DNA

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon29 Review: Gene Jargon #2.5 Reading Frames

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon30 BLAST - How does it work? Main idea - based on dot plots!

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon31 Dot Plots - apply in BLAST: Perform fast, approximate local alignments to find sequences in database that are related to query sequence Here, use 4-base "window" 75% identity (allow mismatches)

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon32 Detailed Steps in BLAST algorithm 1.Remove low-complexity regions (LCRs) 2.Make a list (dictionary): all words of length 3aa or 11 nt 3.Augment list to include similar words 4.Store list in a search tree (data structure) 5.Scan database for occurrences of words in search tree 6.Connect nearby occurrences 7.Extend matches (words) in both directions 8.Prune list of matches using a score threshold 9.Evaluate significance of each remaining match 10.Perform Smith-Waterman to get alignment

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon33 1: Filter low-complexity regions (LCRs) Window length (usually 12) Alphabet size (4 or 20) Frequency of ith letter in the window Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology. Low complexity sequences can yield false positives. Screen them out of your query sequences! When appropriate! K = computational complexity ; varies from 0 (very low complexity) to 1 (high complexity) e.g., for GGGG: L! = 4!=4x3x2x1= 24 n G =4 n T =n A =n C =0  n i ! = 4!x0!x0!x0! = 24 K=1/4 log 4 (24/24) = 0 For CGTA: K=1/4 log 4 (24/1) = 0.57 This slide has been changed!

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon34 2: List all words in query YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK …

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon35 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … AAA AAB AAC … YYY 20 3 = 8000 possible matches

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon36 3: Augment word list G G F A A A = -2 BLOSUM62 scores Non-match G G F G G Y = 15 Match A user-specified threshold, T, determines which 3-letter words are considered matches and non-matches

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon37 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … GGI GGL GGM GGF GGW GGY …

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon38 3: Augment word list Observation: Selecting only words with score > T greatly reduces number of possible matches otherwise, 20 3 for 3-letter words from amino acid sequences!

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon39 Example A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V Find all words that match EAM with a score greater than or equal to 11 EAM = 14 DAM = 11 QAM = 11 ESM = 11 EAL = 11

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon40 4: Store words in search tree Search tree Augmented list of query words “Does this query contain GGF?” “Yes, at position 2.”

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon41 Search tree G G LMFWY GGF GGL GGM GGW GGY

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon42 Example Put this word list into a search tree DAM QAM EAM KAM ECM EGM ESM ETM EVM EAI EAL EAV DQEK AAAGSTVAC MMMMMMMM L M I V

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon43 5: Scan the database sequences Database sequence Query sequence        

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon44 Example Scan this "database" for occurrences of your words MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA E A M P Q L S V D A M 

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon45 6: Connect nearby occurences (diagonal matches in Gapped BLAST) Database sequence Query sequence         Two dots are connected IFF if they are less than A letters apart & are on diagonal

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon46 7: Extend matches in both directions DB Scan

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon47 7: Extend matches, calculating score at each step Each match is extended to left & right until a negative BLOSUM62 score is encountered Extension step typically accounts for > 90% of execution time L P P Q G L L Query sequence M P P E G L L Database sequence BLOSUM62 scores word score = HSP SCORE = 32 (High Scoring Pair)

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon48 8: Prune matches Discard all matches that score below defined threshold

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon49 9: Evaluate significance BLAST uses an analytical statistical significance calculation RECALL: 1.E-value: E = m x n x P m = total number of residues in database n = number of residues in query sequence P = probability that an HSP is result of random chance lower E-value, less likely to result from random chance, thus higher significance 2.Bit Score: S' = normalized score, to account for differences in size of database (m) & sequence length(n) ; Note (below) that bit score is linearly related to raw alignment score, so: higher S' means alignment has higher significance This slide has been changed! S'= ( X S - ln K)/ln2 where: = Gumble distribution constant S = raw alignment score K = constant associated with scoring matrix For more details - see text & BLAST tutorial

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon50 10: Use Smith-Waterman algorithm (DP) to generate alignment ONLY significant matches are re-analyzed using Smith-Waterman DP algorithm. Alignments reported by BLAST are produced by dynamic programming

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon51 BLAST: What is a "Hit"? A hit is a w-length word in database that aligns with a word from query sequence with score > T BLAST looks for hits instead of exact matches Allows word size to be kept larger for speed, without sacrificing sensitivity Typically, w = 3-5 for amino acids, w = for DNA T is the most critical parameter: ↑ T  ↓ “background” hits (faster) ↓ T  ↑ ability to detect more distant relationships (at cost of increased noise)

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon52 Tips for BLAST Similarity Searches If you don’t know, use default parameters first Try several programs & several parameter settings If possible, search on protein sequence level Scoring matrices: PAM1 / BLOSUM80: if expect/want less divergent proteins PAM120 / BLOSUM62: "average" proteins PAM250 / BLOSUM45: if need to find more divergent proteins Proteins: >25-30% identity ( and >100aa )-> likely related 15-25% identity -> twilight zone likely unrelated

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon53 Practical Issues Searching on DNA or protein level? In general, protein-encoding DNA should be translated! DNA yields more random matches: 25% for DNA vs. 5% for proteins DNA databases are larger and grow faster Selection (generally) acts on protein level Synonymous mutations are usually neutral DNA sequence similarity decays faster

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon54 BLAST vs FASTA Seeding: BLAST integrates scoring matrix into first phase FASTA requires exact matches (uses hashing) BLAST increases search speed by finding fewer, but better, words during initial screening phase FASTA uses shorter word sizes - so can be more sensitive Results: BLAST can return multiple best scoring alignments FASTA returns only one final alignment

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon55 BLAST & FASTA References FASTA - developed first Pearson & Lipman (1988) Improved Tools for Biological Sequence Comparison. PNAS 85: BLAST Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon56 BLAST Notes - & DP Alternatives BLAST uses heuristics: it may miss some good matches But, it’s fast: X faster than Smith-Waterman (SW) DP Large impact: NCBI’s BLAST server handles more than 100,000 queries/day Most used bioinformatics program in the world!  But - Xiong says: "It has been estimated that for some families of protein sequences BLAST can miss 30% of truly significant matches." Increased availability of parallel processing has made DP-based approaches feasible: 2 DP-based web servers: both more sensitive than BLAST Scan Protein Sequence: Implements modified SW optimized for parallel processing ParAlign - parallel SW or heuristicswww.paralign.org

9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon57 NCBI - BLAST Programs Glossary & Tutorials BLAST