Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, 2001. All rights reserved.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
We continue where we stopped last week: FASTA – BLAST
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
1 Improved tools for biological sequence comparison Author: WILLIAM R. PEARSON, DAVID J. LIPMAN Publisher: Proc. Natl. Acad. Sci. USA 1988 Presenter: Hsin-Mao.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
15-853:Algorithms in the Real World
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Alignment Algorithms Hongchao Li Jan
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Lecture #7: FASTA & LFASTA
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.

Efficient database searching methods Dynamic programming requires order N 2 L computations (where N is size of the query sequence and L is the size of the database) Dynamic programming requires order N 2 L computations (where N is size of the query sequence and L is the size of the database) Given size of databases, more efficient methods needed Given size of databases, more efficient methods needed

“Hit and extend” sequence searching Problem: Too many calculations “wasted” by comparing regions that have nothing in common Problem: Too many calculations “wasted” by comparing regions that have nothing in common Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Basic method: Look for similar regions only near short stretches that match exactly Basic method: Look for similar regions only near short stretches that match exactly

“Hit and extend” sequence searching We define a word size that is the minimum number of exact “letter” matches that must occur before we do any further comparison or alignment We define a word size that is the minimum number of exact “letter” matches that must occur before we do any further comparison or alignment How do we find all of the occurences of matching words between a sequence and a database? How do we find all of the occurences of matching words between a sequence and a database?  Could scan sequence a word at a time, but this is order L (size of database)

Word searching - hashing Solution: Use a precomputed table that lists where in the database each possible word occurs Solution: Use a precomputed table that lists where in the database each possible word occurs  Generation of the table is of order L (size of database) but use of the table is of order N (size of query sequence) The computer science term for this approach is hashing The computer science term for this approach is hashing

Hashing (Demonstration A9) (Demonstration A9)

Database searching using words References W. J. Wilbur and D. J. Lipman. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80: (1983) W. J. Wilbur and D. J. Lipman. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80: (1983) D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science 227: (1985) [FASTP] D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science 227: (1985) [FASTP] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85: (1988) [FASTA] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85: (1988) [FASTA]

FASTA Heavily used for searching databases until advent of BLAST (see below) Heavily used for searching databases until advent of BLAST (see below) Inputs Inputs  k (word or ktuple) size  similarity matrix Compares query sequence pairwise with each sequence in the database Compares query sequence pairwise with each sequence in the database

FASTA method 1. Find diagonals (paired pieces from each sequence without gaps) that have the highest density of common words 2. Rescore these using a scoring (similarity) matrix and trim ends that do not contribute to the highest score  Result: partial alignments without gaps  Reported as the “init1” score

FASTA method 3. Join regions together, including penalties for gaps  Result: unoptimized alignment with gaps  Reported as the “initn” score 4. Use dynamic programming in a band 32 residues wide around the best “initn” score  Result: optimized alignment with gaps  Reported as the “opt” score

Comments on FASTA Larger ktuple increases speed since fewer “hits” are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required Larger ktuple increases speed since fewer “hits” are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required

Limitations of FASTA FASTA can miss significant similarity since FASTA can miss significant similarity since  For proteins, similar sequences do not have to share identical residues  Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with ktuple size of 1 since no amino acid matches  Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is match with ktuple size of 2

Limitations of FASTA FASTA can miss significant similarity since FASTA can miss significant similarity since  For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not  GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser- Thr-Lys) but they don’t match with ktuple size of 3 or higher

BLAST (Basic Local Alignment Search Tool) Goal: find sequences from database similar to query sequence Goal: find sequences from database similar to query sequence Previous tools use either Previous tools use either  direct, theoretically sound but computationally slow approach to examine all possible alignments of query with database (dynamic programming)  indirect, heuristic but computationally fast approach to find similar sequences by first finding identical stretches (FASTP, FASTA)

BLAST (Basic Local Alignment Search Tool) BLAST combines best of both by using theoretically sound method which searches for similar sequences directly but computationally fast BLAST combines best of both by using theoretically sound method which searches for similar sequences directly but computationally fast Reference Reference  S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215: (1990)

Global vs. Local Algorithms We distinguish We distinguish  Global similar algorithms which optimize overall alignment between two sequences (dynamic programming)  Local similar algorithms which see only relatively conserved pieces of sequence (FASTA, BLAST)

BLAST basics Need similarity measure, as in dynamic programming - use PAM-120 for proteins Need similarity measure, as in dynamic programming - use PAM-120 for proteins Define maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal) Define maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal)

BLAST basics Define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments Define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments

BLAST basics Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T Key concept: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly Key concept: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly

BLAST method for proteins 1. Compile a list of words which give a score above T when paired with the query sequence.  Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E A C D E ACDE = = 22  try all possibilities: AAAA = = 0 no good AAAC = = -7 no good ...too slow, try directed change

Generating word list A C D E A C D E ACDE = = 22  change 1st pos. to all acceptable substitutions gCDE = = 20 ok (=pCDE,sCDE, tCDE) tCDE) nCDE = = 19 ok (=dCDE,eCDE, nCDE,vCDE) nCDE,vCDE) iCDE = = 18 ok (=qCDE) kCDE = = 17 ok (=mCDE)  change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13  change 3rd pos. in combination with first position gCnE = = 17 ok  continue - use recursion

Generating word list For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence

BLAST method for proteins 2. Scan the database for hits with the compiled list of words. Two approaches:  Use index of all possible words (for w=4, need array of size 20 4 =160,000. Can compress this index using pointers to save space.  Use finite state machine (actually used)  Calculate a state transition table that tells what state to go to based on the next character in the sequence 3. Extend hits to form segment pairs

BLAST Method for DNA 1. Make list of all contiguous w-mers in the query sequence (often w=12) 1. Make list of all contiguous w-mers in the query sequence (often w=12) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesn't allow for unspecified bases (wildcards) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesn't allow for unspecified bases (wildcards)

BLAST Method for DNA 3. Compress the w-mers from the query sequence the same way. 3. Compress the w-mers from the query sequence the same way. 4. Search the compressed database for matches with the compressed w-mers 4. Search the compressed database for matches with the compressed w-mers  Since all frames of the query sequence are considered separately, any match of length w>=11 must contain a match of length 8 that lies on a byte boundary of one of the w-mers from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.

BLAST Method for DNA Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions. Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions.

BLAST Method for DNA Solution: Solution:  During compression of the database, tabulate frequencies of all 8-tuples.  Make a list of those occurring very frequently (more frequently than expected by chance).  Remove these words from the query list of w- mers before searching database.  Remove words matching a sublibrary of repeated sequences (but report the matches to that sublibrary when done).

BLAST Statistical significance A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability

Summary of Database Search Methods