Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, 2001. All rights reserved.

Similar presentations


Presentation on theme: "Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, 2001. All rights reserved."— Presentation transcript:

1 Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, 2001. All rights reserved.

2 Efficient database searching methods Dynamic programming requires order N 2 L computations (where N is size of the query sequence and L is the size of the database) Dynamic programming requires order N 2 L computations (where N is size of the query sequence and L is the size of the database) Given size of databases, more efficient methods needed Given size of databases, more efficient methods needed

3 “Hit and extend” sequence searching Problem: Too many calculations “wasted” by comparing regions that have nothing in common Problem: Too many calculations “wasted” by comparing regions that have nothing in common Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Basic method: Look for similar regions only near short stretches that match exactly Basic method: Look for similar regions only near short stretches that match exactly

4 “Hit and extend” sequence searching We define a word size that is the minimum number of exact “letter” matches that must occur before we do any further comparison or alignment We define a word size that is the minimum number of exact “letter” matches that must occur before we do any further comparison or alignment How do we find all of the occurences of matching words between a sequence and a database? How do we find all of the occurences of matching words between a sequence and a database?  Could scan sequence a word at a time, but this is order L (size of database)

5 Word searching - hashing Solution: Use a precomputed table that lists where in the database each possible word occurs Solution: Use a precomputed table that lists where in the database each possible word occurs  Generation of the table is of order L (size of database) but use of the table is of order N (size of query sequence) The computer science term for this approach is hashing The computer science term for this approach is hashing

6 Hashing (Demonstration A9) (Demonstration A9)

7 Database searching using words References W. J. Wilbur and D. J. Lipman. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80:726-730 (1983) W. J. Wilbur and D. J. Lipman. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80:726-730 (1983) D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science 227:1435-1441 (1985) [FASTP] D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science 227:1435-1441 (1985) [FASTP] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444-2448 (1988) [FASTA] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444-2448 (1988) [FASTA]

8 FASTA Heavily used for searching databases until advent of BLAST (see below) Heavily used for searching databases until advent of BLAST (see below) Inputs Inputs  k (word or ktuple) size  similarity matrix Compares query sequence pairwise with each sequence in the database Compares query sequence pairwise with each sequence in the database

9 FASTA method 1. Find diagonals (paired pieces from each sequence without gaps) that have the highest density of common words 2. Rescore these using a scoring (similarity) matrix and trim ends that do not contribute to the highest score  Result: partial alignments without gaps  Reported as the “init1” score

10 FASTA method 3. Join regions together, including penalties for gaps  Result: unoptimized alignment with gaps  Reported as the “initn” score 4. Use dynamic programming in a band 32 residues wide around the best “initn” score  Result: optimized alignment with gaps  Reported as the “opt” score

11 Comments on FASTA Larger ktuple increases speed since fewer “hits” are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required Larger ktuple increases speed since fewer “hits” are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required

12 Limitations of FASTA FASTA can miss significant similarity since FASTA can miss significant similarity since  For proteins, similar sequences do not have to share identical residues  Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with ktuple size of 1 since no amino acid matches  Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is match with ktuple size of 2

13 Limitations of FASTA FASTA can miss significant similarity since FASTA can miss significant similarity since  For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not  GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser- Thr-Lys) but they don’t match with ktuple size of 3 or higher

14 BLAST (Basic Local Alignment Search Tool) Goal: find sequences from database similar to query sequence Goal: find sequences from database similar to query sequence Previous tools use either Previous tools use either  direct, theoretically sound but computationally slow approach to examine all possible alignments of query with database (dynamic programming)  indirect, heuristic but computationally fast approach to find similar sequences by first finding identical stretches (FASTP, FASTA)

15 BLAST (Basic Local Alignment Search Tool) BLAST combines best of both by using theoretically sound method which searches for similar sequences directly but computationally fast BLAST combines best of both by using theoretically sound method which searches for similar sequences directly but computationally fast Reference Reference  S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215:403- 410 (1990)

16 Global vs. Local Algorithms We distinguish We distinguish  Global similar algorithms which optimize overall alignment between two sequences (dynamic programming)  Local similar algorithms which see only relatively conserved pieces of sequence (FASTA, BLAST)

17 BLAST basics Need similarity measure, as in dynamic programming - use PAM-120 for proteins Need similarity measure, as in dynamic programming - use PAM-120 for proteins Define maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal) Define maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal)

18 BLAST basics Define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments Define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments

19 BLAST basics Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T Key concept: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly Key concept: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly

20 BLAST method for proteins 1. Compile a list of words which give a score above T when paired with the query sequence.  Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E A C D E ACDE = +3 +9 +5 +5 = 22  try all possibilities: AAAA = +3 -3 0 0 = 0 no good AAAC = +3 -3 0 -7 = -7 no good ...too slow, try directed change

21 Generating word list A C D E A C D E ACDE = +3 +9 +5 +5 = 22  change 1st pos. to all acceptable substitutions gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE, tCDE) tCDE) nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE, nCDE,vCDE) nCDE,vCDE) iCDE = -1 9 5 5 = 18 ok (=qCDE) kCDE = -2 9 5 5 = 17 ok (=mCDE)  change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13  change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok  continue - use recursion

22 Generating word list For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence

23 BLAST method for proteins 2. Scan the database for hits with the compiled list of words. Two approaches:  Use index of all possible words (for w=4, need array of size 20 4 =160,000. Can compress this index using pointers to save space.  Use finite state machine (actually used)  Calculate a state transition table that tells what state to go to based on the next character in the sequence 3. Extend hits to form segment pairs

24 BLAST Method for DNA 1. Make list of all contiguous w-mers in the query sequence (often w=12) 1. Make list of all contiguous w-mers in the query sequence (often w=12) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesn't allow for unspecified bases (wildcards) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesn't allow for unspecified bases (wildcards)

25 BLAST Method for DNA 3. Compress the w-mers from the query sequence the same way. 3. Compress the w-mers from the query sequence the same way. 4. Search the compressed database for matches with the compressed w-mers 4. Search the compressed database for matches with the compressed w-mers  Since all frames of the query sequence are considered separately, any match of length w>=11 must contain a match of length 8 that lies on a byte boundary of one of the w-mers from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.

26 BLAST Method for DNA Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions. Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions.

27 BLAST Method for DNA Solution: Solution:  During compression of the database, tabulate frequencies of all 8-tuples.  Make a list of those occurring very frequently (more frequently than expected by chance).  Remove these words from the query list of w- mers before searching database.  Remove words matching a sublibrary of repeated sequences (but report the matches to that sublibrary when done).

28 BLAST Statistical significance A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability

29 Summary of Database Search Methods


Download ppt "Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, 2001. All rights reserved."

Similar presentations


Ads by Google