Sequence Similarity Search 2005/10/13 2005 Autumn / YM / Bioinformatics.

Sequence Similarity Search H.C.Huang @ 2005/10/13 2005 Autumn / YM / Bioinformatics

Outline  Review  BLAST  Database search algorithm  Result statistical significance

Reference  D.W. Mount / Bioinformatics Ch.6 pp.228-259  Slides from  UW / Genomic Informatics / W.S. Noble  Reference:  Prof. C.H.Chang ’ s slides

PA Review  Scoring a pairwise alignment requires a substition matrix and gap penalties.  Dynamic programming is an efficient algorithm for finding the optimal alignment.  Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions.  DP iteratively fills in the matrix using a simple mathematical rule.  Local alignment finds the best match between subsequences.

PA Review  Smith-Waterman local alignment algorithm:  No score is negative.  Trace back from the largest score in the matrix.  Substitution matrices represent the probability of mutations.  PAM / BLOSUM(62)  Affine gap penalties include a large gap opening penalty and small gap extension penalty.

MSA Review  Global multiple sequence alignment  Simultaneous comparison of many sequences often find similarities that are invisible in PA.  Progressive Method  Use pairwise alignment to iteratively add one sequence to a growing MSA  ClustalW

ClustalW Procedure “Once a gap, always a gap”

Origin of Sequence Similarity  Evolution  Similar sequences come from same ancestor sequence with mutations

Sequence Alignment vs. Similarity Searching  Similarity searching  Searching for homologies to elucidate the function of an unknown protein  Produces alignments, but the desired result is the score  Sequence alignment  Searching for consensus sequence  Produces a score, but the desired result is the sequence

Sequence database Database searching Sequence comparison algorithm Query Targets ranked by score

How long does DP take? Dynamic programming matrix Target sequence of length m Query sequence of length n

How long does DP take? Dynamic programming matrix Target sequence of length m Query sequence of length n There are nm entries in the matrix. Each entry requires a constant number c of operations. The total number of required operations is approximate nmc. We say that the algorithm is “order nm” or “O(nm).”

How long does DP take? Say that your query is 200 amino acids long. You are searching the non-redundant database, which currently contains about a million proteins. If their average length is 200, then you have to fill in 200  200  1,000,000 = 4  10 10 DP entries. If it takes only 10 operations to fill in each cell, then you still have to do 4  10 11 floating point operations!

BLAST DP is O(nm); BLAST is O(m). Fundamental innovation: employ a data structure to index the query sequence. The data structure allows you to look up entries in a table in O(1) time. Does my length-n sequence contain the subsequence “GTR”? Naive method: scan the sequence Improved method: hash table or search tree lookup O(n) O(1)

Most-Cited Papers, 1983-2002 Rank PaperCitations 1 Chomczynski, N. Sacchi, "Single-step method of RNA isolation by acid guanidinium thiocyanate phenol chloroform extraction," Analyt. Biochem., 162(1): 156-9, 1987. 49,562 2 A.P. Feinberg, B. Vogelstein, "A technique for radiolabeling DNA restriction endonuclease fragments to high specific activity," Analyt. Biochem., 132(1): 6- 13, 1983. 20,609 3 S.F. Altschul, et al., "Basic Local Alignment Search Tool," J. Molec. Biol., 215(3): 403- 10, 1990. 15,306 4 G. Grynkiewicz, M. Poenie, R.Y. Tsien, "A new generation of CA-2+ indicators with greatly improved fluorescence properties," J. Biol. Chem., 260(6): 3440-50, 1985. 14,357 5 J. Devereux, P. Haeberli, O. Smithies, "A comprehensive set of sequence-analysis programs for the VAX," Nucleic Acids Res., 12(1): 387-95, 1984. 13,056 SOURCE: Thomson ISI Web of ScienceWeb of Science BLAST: Most cited paper in 1990s http://www.sciencewatch.com/sept-oct2003/sw_sept-oct2003_page1.htm

O ’ Reilly Book

BLAST algorithm 1.Remove low-complexity regions. 2.Make a list of all words of length 3 amino acids or 11 nucleotides. 3.Augment the list to include similar words. 4.Store the list in a search tree. 5.Scan the database for occurrences of the words in the search tree. 6.Connect nearby occurrences. 7.Extend the matches. 8.Prune the list of matches using a score threshold. 9.Evaluate the significance of each remaining match. 10.Perform Smith- Waterman to get an alignment. See pp. 248-253 for details.

1: Sequence filtering Low complexity sequences yield false positives. Therefore, replace these regions with Xs.

Example of repeats

1: Sequence filtering Low complexity sequences yield false positives. Therefore, replace these regions with Xs. Window length Alphabet size (4 or 20) Frequency of the i th letter

2: List all words in query YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK …

3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … AAA AAB AAC … YYY

3: Augment word list G G F A A A 0 + 0 + -2 = -2 BLOSUM62 scores Non-match

3: Augment word list G G F A A A 0 + 0 + -2 = -2 G G F G G Y 6 + 6 + 3 = 15 BLOSUM62 scores Non-match Match A user-specified threshold determines which three-letter words are considered matches and non-matches.

3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … GGI GGL GGM GGF GGW GGY …

4: Store words in search tree Search tree Augmented list of query words “Does this query contain GGF?” “Yes, at position 2.” O(1) time

Search tree G G LMFWY

5: Scan the database Database sequence Query sequence x x x x x x x x

6: Connect diagonal matches Database sequence Query sequence x x x x x x x x Two x’s are connected if and only if they are less than A letters apart and are on the diagonal.

7: Extend matches Each match is extended to the left and right until a negative BLOSUM62 score is encountered. L P P Q G L L Query sequence M P P E G L L Database sequence 7 2 6 BLOSUM62 scores word score = 15 2 7 7 2 6 4 4 HSP SCORE = 32

8: Prune matches Discard all matches that score below some threshold S.

9: Evaluate significance BLAST uses an analytical statistical significance calculation, which we will learn later.

10: Smith-Waterman Significant matches are re-analyzed using the Smith-Waterman dynamic programming algorithm. The alignments reported by BLAST are produced by dynamic programming.

BLAST parameters Filter sequences or not (yes) Word length (3) Substitution matrix (BLOSUM62) Word threshold (11) Diagonal distance (?) Match score threshold (?) Gap open and extension penalties (11, 1)

Summary (I) Dynamic programming is O(nm), where n is the length of the query and m is the size of the database. BLAST is O(m). BLAST produces an index of the query sequence that allows fast matching to the database.

Review: BLAST Query sequence Target sequence Query List of words in query and similar words

BLAST Query sequence Target sequence Query List of words in query and similar words “Does this target word appear in the query word list?”

“Yes, at position 34 in the query sequence.” BLAST Query sequence x Target sequence Query List of words in query and similar words

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words These two hits are on the diagonal and close to each other, so let’s try to connect them. x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x Target sequence Query List of words in query and similar words x 0.005 0.27 Assign an E-value to each hit

BLAST “The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.” The initial word threshold T is the most important parameter. Low T = high sensitivity, long compute. High T = low sensitivity, quick compute.

Statistics Review What is a distribution?

Review What is a distribution? –A plot showing the frequency of a given variable or observation.

Review What is a null hypothesis?

Review What is a null hypothesis? –A statistician’s way of characterizing “chance.” –Generally, a mathematical model of randomness with respect to a particular set of observations. –The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Review Examples of null hypotheses: –Sequence comparison using shuffled sequences. –A normal distribution of log ratios from a microarray experiment. –LOD scores from genetic linkage analysis when the relevant loci are randomly sprinkled throughout the genome.

Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

Review What is a p-value?

Review What is a p-value? –The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” –Pr(x > S|null)

Review If BLAST returns a score of 75, how would you compute the corresponding p-value?

Review If BLAST returns a score of 75, how would you compute the corresponding p-value? –First, compute many BLAST scores using random queries and a random database. –Summarize those scores into a distribution. –Compute the area of the distribution to the right of the observed score (more details to come).

Review What is the name of the distribution created by sequence similarity scores, and what does it look like?

Review What is the name of the distribution created by sequence similarity scores, and what does it look like? –Extreme value distribution, or Gumbel distribution. –It looks similar to a normal distribution, but it has a larger tail on the right.

Exponential distribution

Normal distribution

Extreme value distribution This distribution is characterized by a larger tail on the right.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x=0.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x = 0. Solution: exp[-1] = 0.368

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = 0.693. What is the p- value associated with 45?

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = 0.693. What is the p-value associated with 45?

Another example You run BLAST and get a score of 23. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 20 and λ = 0.744. What is the p-value associated with 23?

What p-value is significant?

The most common thresholds are 0.01 and 0.05. A threshold of 0.05 means you are 95% sure that the result is significant. Is 95% enough? It depends upon the cost associated with making a mistake. Examples of costs: –Doing expensive wet lab validation. –Making clinical treatment decisions. –Misleading the scientific community. Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = 0.95 20 = 0.358 Pr(making at least one mistake) = 1 - 0.358 = 0.642 There is a 64.2% chance of making at least one mistake.

Bonferroni correction Assume that individual tests are independent. (Is this a reasonable assumption?) Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = 0.0025. Pr(making a mistake) = 0.0025 Pr(not making a mistake) = 0.9975 Pr(not making any mistake) = 0.9975 20 = 0.9512 Pr(making at least one mistake) = 1 - 0.9512 = 0.0488

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p- value threshold should you use?

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? Say that you want to use a conservative p-value of 0.001. Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. A Bonferroni correction would suggest using a p- value threshold of 0.001 / 1,000,000 = 0.000000001 = 10 -9.

E-values A p-value is the probability of making a mistake. The E-value is a version of the p-value that is corrected for multiple tests; it is essentially the converse of the Bonferroni correction. The E-value is computed by multiplying the p- value times the size of the database. The E-value is the expected number of times that the given score would appear in a random database of the given size. Thus, for a p-value of 0.001 and a database of 1,000,000 sequences, the corresponding E- value is 0.001 × 1,000,000 = 1,000.

E-value vs. Bonferroni You observe among n repetitions of a test a particular p-value p. You want a significance threshold α. Bonferroni: Divide the significance threshold by α –p < α/n. E-value: Multiply the p-value by n. –pn < α. * BLAST actually calculates E-values in a slightly more complex way.

Summary (II) Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that the given score would appear in a random database of the given size.

Position-specific iterated BLAST BLAST Query Sequence database Statistical model of protein family Homologs

PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs

Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs PSI-BLAST pseudocode Position-specific scoring matrix

PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold.

Summary (III) PSI-BLAST iterates BLAST, adding new homologs at each iteration.

Sequence Similarity Search 2005/10/13 2005 Autumn / YM / Bioinformatics.

Similar presentations

Presentation on theme: "Sequence Similarity Search 2005/10/13 2005 Autumn / YM / Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Similarity Search 2005/10/13 2005 Autumn / YM / Bioinformatics.

Similar presentations

Presentation on theme: "Sequence Similarity Search 2005/10/13 2005 Autumn / YM / Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback