Sequence Similarity Search 2005/10/13 2005 Autumn / YM / Bioinformatics.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Pairwise Sequence Alignment
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Multiple Sequence Alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Protein Sequence Comparison Patrice Koehl
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple testing correction
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Sequence alignment, E-value & Extreme value distribution
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Sequence Similarity Search 2005/10/ Autumn / YM / Bioinformatics

Outline  Review  BLAST  Database search algorithm  Result statistical significance

Reference  D.W. Mount / Bioinformatics Ch.6 pp  Slides from  UW / Genomic Informatics / W.S. Noble  Reference:  Prof. C.H.Chang ’ s slides

PA Review  Scoring a pairwise alignment requires a substition matrix and gap penalties.  Dynamic programming is an efficient algorithm for finding the optimal alignment.  Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions.  DP iteratively fills in the matrix using a simple mathematical rule.  Local alignment finds the best match between subsequences.

PA Review  Smith-Waterman local alignment algorithm:  No score is negative.  Trace back from the largest score in the matrix.  Substitution matrices represent the probability of mutations.  PAM / BLOSUM(62)  Affine gap penalties include a large gap opening penalty and small gap extension penalty.

MSA Review  Global multiple sequence alignment  Simultaneous comparison of many sequences often find similarities that are invisible in PA.  Progressive Method  Use pairwise alignment to iteratively add one sequence to a growing MSA  ClustalW

ClustalW Procedure “Once a gap, always a gap”

Origin of Sequence Similarity  Evolution  Similar sequences come from same ancestor sequence with mutations

Sequence Alignment vs. Similarity Searching  Similarity searching  Searching for homologies to elucidate the function of an unknown protein  Produces alignments, but the desired result is the score  Sequence alignment  Searching for consensus sequence  Produces a score, but the desired result is the sequence

Sequence database Database searching Sequence comparison algorithm Query Targets ranked by score

How long does DP take? Dynamic programming matrix Target sequence of length m Query sequence of length n

How long does DP take? Dynamic programming matrix Target sequence of length m Query sequence of length n There are nm entries in the matrix. Each entry requires a constant number c of operations. The total number of required operations is approximate nmc. We say that the algorithm is “order nm” or “O(nm).”

How long does DP take? Say that your query is 200 amino acids long. You are searching the non-redundant database, which currently contains about a million proteins. If their average length is 200, then you have to fill in 200  200  1,000,000 = 4  DP entries. If it takes only 10 operations to fill in each cell, then you still have to do 4  floating point operations!

BLAST DP is O(nm); BLAST is O(m). Fundamental innovation: employ a data structure to index the query sequence. The data structure allows you to look up entries in a table in O(1) time. Does my length-n sequence contain the subsequence “GTR”? Naive method: scan the sequence Improved method: hash table or search tree lookup O(n) O(1)

Most-Cited Papers, Rank PaperCitations 1 Chomczynski, N. Sacchi, "Single-step method of RNA isolation by acid guanidinium thiocyanate phenol chloroform extraction," Analyt. Biochem., 162(1): 156-9, ,562 2 A.P. Feinberg, B. Vogelstein, "A technique for radiolabeling DNA restriction endonuclease fragments to high specific activity," Analyt. Biochem., 132(1): 6- 13, ,609 3 S.F. Altschul, et al., "Basic Local Alignment Search Tool," J. Molec. Biol., 215(3): , ,306 4 G. Grynkiewicz, M. Poenie, R.Y. Tsien, "A new generation of CA-2+ indicators with greatly improved fluorescence properties," J. Biol. Chem., 260(6): , ,357 5 J. Devereux, P. Haeberli, O. Smithies, "A comprehensive set of sequence-analysis programs for the VAX," Nucleic Acids Res., 12(1): , ,056 SOURCE: Thomson ISI Web of ScienceWeb of Science BLAST: Most cited paper in 1990s

O ’ Reilly Book

BLAST algorithm 1.Remove low-complexity regions. 2.Make a list of all words of length 3 amino acids or 11 nucleotides. 3.Augment the list to include similar words. 4.Store the list in a search tree. 5.Scan the database for occurrences of the words in the search tree. 6.Connect nearby occurrences. 7.Extend the matches. 8.Prune the list of matches using a score threshold. 9.Evaluate the significance of each remaining match. 10.Perform Smith- Waterman to get an alignment. See pp for details.

1: Sequence filtering Low complexity sequences yield false positives. Therefore, replace these regions with Xs.

Example of repeats

1: Sequence filtering Low complexity sequences yield false positives. Therefore, replace these regions with Xs. Window length Alphabet size (4 or 20) Frequency of the i th letter

2: List all words in query YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK …

3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … AAA AAB AAC … YYY

3: Augment word list G G F A A A = -2 BLOSUM62 scores Non-match

3: Augment word list G G F A A A = -2 G G F G G Y = 15 BLOSUM62 scores Non-match Match A user-specified threshold determines which three-letter words are considered matches and non-matches.

3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … GGI GGL GGM GGF GGW GGY …

4: Store words in search tree Search tree Augmented list of query words “Does this query contain GGF?” “Yes, at position 2.” O(1) time

Search tree G G LMFWY

5: Scan the database Database sequence Query sequence x x x x x x x x

6: Connect diagonal matches Database sequence Query sequence x x x x x x x x Two x’s are connected if and only if they are less than A letters apart and are on the diagonal.

7: Extend matches Each match is extended to the left and right until a negative BLOSUM62 score is encountered. L P P Q G L L Query sequence M P P E G L L Database sequence BLOSUM62 scores word score = HSP SCORE = 32

8: Prune matches Discard all matches that score below some threshold S.

9: Evaluate significance BLAST uses an analytical statistical significance calculation, which we will learn later.

10: Smith-Waterman Significant matches are re-analyzed using the Smith-Waterman dynamic programming algorithm. The alignments reported by BLAST are produced by dynamic programming.

BLAST parameters Filter sequences or not (yes) Word length (3) Substitution matrix (BLOSUM62) Word threshold (11) Diagonal distance (?) Match score threshold (?) Gap open and extension penalties (11, 1)

Summary (I) Dynamic programming is O(nm), where n is the length of the query and m is the size of the database. BLAST is O(m). BLAST produces an index of the query sequence that allows fast matching to the database.

Review: BLAST Query sequence Target sequence Query List of words in query and similar words

BLAST Query sequence Target sequence Query List of words in query and similar words “Does this target word appear in the query word list?”

“Yes, at position 34 in the query sequence.” BLAST Query sequence x Target sequence Query List of words in query and similar words

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words These two hits are on the diagonal and close to each other, so let’s try to connect them. x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x Target sequence Query List of words in query and similar words x Assign an E-value to each hit

BLAST “The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.” The initial word threshold T is the most important parameter. Low T = high sensitivity, long compute. High T = low sensitivity, quick compute.

Statistics Review What is a distribution?

Review What is a distribution? –A plot showing the frequency of a given variable or observation.

Review What is a null hypothesis?

Review What is a null hypothesis? –A statistician’s way of characterizing “chance.” –Generally, a mathematical model of randomness with respect to a particular set of observations. –The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Review Examples of null hypotheses: –Sequence comparison using shuffled sequences. –A normal distribution of log ratios from a microarray experiment. –LOD scores from genetic linkage analysis when the relevant loci are randomly sprinkled throughout the genome.

Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

Review What is a p-value?

Review What is a p-value? –The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” –Pr(x > S|null)

Review If BLAST returns a score of 75, how would you compute the corresponding p-value?

Review If BLAST returns a score of 75, how would you compute the corresponding p-value? –First, compute many BLAST scores using random queries and a random database. –Summarize those scores into a distribution. –Compute the area of the distribution to the right of the observed score (more details to come).

Review What is the name of the distribution created by sequence similarity scores, and what does it look like?

Review What is the name of the distribution created by sequence similarity scores, and what does it look like? –Extreme value distribution, or Gumbel distribution. –It looks similar to a normal distribution, but it has a larger tail on the right.

Exponential distribution

Normal distribution

Extreme value distribution This distribution is characterized by a larger tail on the right.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x=0.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x = 0. Solution: exp[-1] = 0.368

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p- value associated with 45?

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p-value associated with 45?

Another example You run BLAST and get a score of 23. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 20 and λ = What is the p-value associated with 23?

Another example You run BLAST and get a score of 23. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 20 and λ = What is the p-value associated with 23?

What p-value is significant?

The most common thresholds are 0.01 and A threshold of 0.05 means you are 95% sure that the result is significant. Is 95% enough? It depends upon the cost associated with making a mistake. Examples of costs: –Doing expensive wet lab validation. –Making clinical treatment decisions. –Misleading the scientific community. Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = = Pr(making at least one mistake) = = There is a 64.2% chance of making at least one mistake.

Bonferroni correction Assume that individual tests are independent. (Is this a reasonable assumption?) Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = Pr(making a mistake) = Pr(not making a mistake) = Pr(not making any mistake) = = Pr(making at least one mistake) = =

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p- value threshold should you use?

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? Say that you want to use a conservative p-value of Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. A Bonferroni correction would suggest using a p- value threshold of / 1,000,000 = =

E-values A p-value is the probability of making a mistake. The E-value is a version of the p-value that is corrected for multiple tests; it is essentially the converse of the Bonferroni correction. The E-value is computed by multiplying the p- value times the size of the database. The E-value is the expected number of times that the given score would appear in a random database of the given size. Thus, for a p-value of and a database of 1,000,000 sequences, the corresponding E- value is × 1,000,000 = 1,000.

E-value vs. Bonferroni You observe among n repetitions of a test a particular p-value p. You want a significance threshold α. Bonferroni: Divide the significance threshold by α –p < α/n. E-value: Multiply the p-value by n. –pn < α. * BLAST actually calculates E-values in a slightly more complex way.

Summary (II) Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that the given score would appear in a random database of the given size.

Position-specific iterated BLAST BLAST Query Sequence database Statistical model of protein family Homologs

PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs

Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs PSI-BLAST pseudocode Position-specific scoring matrix

PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold.

Summary (III) PSI-BLAST iterates BLAST, adding new homologs at each iteration.