Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Branch & Bound Algorithms
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Multiple testing adjustments European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,
CPSC 322, Lecture 9Slide 1 Search: Advanced Topics Computer Science cpsc322, Lecture 9 (Textbook Chpt 3.6) January, 23, 2009.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Heuristic alignment algorithms and cost matrices
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Differentially expressed genes
The Theory of NP-Completeness
Evaluation.
Introduction to Hypothesis Testing
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
Lecture 9: One Way ANOVA Between Subjects
Sequence Similarity Search 2005/10/ Autumn / YM / Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
MAE 552 – Heuristic Optimization Lecture 5 February 1, 2002.
8-2 Basics of Hypothesis Testing
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Protein Sequence Comparison Patrice Koehl
Backtracking.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Multiple testing correction
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Lecture Slides Elementary Statistics Twelfth Edition
Multiple testing in high- throughput biology Petter Mostad.
Overview Definition Hypothesis
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Essential Statistics in Biology: Getting the Numbers Right
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Analysis of Algorithms
BackTracking CS335. N-Queens The object is to place queens on a chess board in such as way as no queen can capture another one in a single move –Recall.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Comp. Genomics Recitation 3 The statistics of database searching.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Significance in protein analysis
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Sequence Alignment.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Chapter 9 Introduction to the t Statistic
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Transcription factor binding motifs
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Presentation transcript:

Comp. Genomics Recitation 2 (week 3) 19/3/09

Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing

Exercise: Finding repeats Basic objective: find a pair of subsequences within S with maximum similarity Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why is this wrong?) But using local alignment is still a good idea

Variant #1 First specification: the two subsequences may overlap Solution: Change the local alignment algorithm: Compute only the upper triangular submatrix (V(i,j), where j>i). Set diagonal values to 0 Complexity: O(n 2 ) time and O(n) space

Variant #2 Ssecond specification: the two sequences may not overlap Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] Check local alignments between S[1..k] and S[k+1..n] for all 1<=k<n Pick the highest-scoring alignment Complexity: O(n 3 ) time and O(n) space

Second Variant, Pictorially

Third Variant Specification: the two sequences must be consecutive (tandem repeat) Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], No penalties for gaps in the beginning of S[1..k] No penalties for gaps in the end of S[k+1..n] Complexity: O(n 3 ) time and O(n) space

Variant #3

Branch and Bound algorithms Another heuristic for hard problems Idea: Compute some rough bound on the score Start exploring the complete solution space Do not explore “regions” that definitely score worse than the bound Maybe: improve the bound as you go along Guarantees optimal solution Does not guarantee runtime Good bound is essential for good performance

Slightly more formal Branch: enumerate all possible next steps from current partial solution Bound: if a partial solution violates some constraint, e.g., an upper bound on cost, drop/prune the branch (don’t follow it further) Backtracking: once a branch is pruned, move back to the previous partial solution and try another branch (depth-first branch-and-bound)

Example: TSP Traveling salesperson problem: Input: Directed full graph with weights on the edges Tour: a directed cycle visiting every node exactly once Problem: find a tour of minimum cost NP-Hard (reduction from directed Hamiltonial cycle…) Difficult to approximate Good example of enumeration

Example: TSP “Find me a tour of 10 cities in 2500 miles”

Reminder: The hypercube Dynamic programming extension Exponential in the number of sequences Source: Ralf Zimmer slides

B&B for MSA Carillo and Lipman (88) See scribe for full details The bound: Every multiple alignment induces k(k-1)/2 pairwise alignments Neither of those is better than the optimal pairwise alignment (Why?) The branching: Compute the complete hypercube, just ignoring regions above the bound

B&B for MSA

(Some) gory details Requires a “forward” recurrence: once cell D(i,j,k) is computed it is “transmitted forward” to the next cells The cube cells stored in a queue The next computed cell is picked up from the top of the queue (Easy) to show that once a cell reaches the top of the queue, it has received “transmissions” from all its relevant neighbors

(Some) gory details d a,b (i,j) is the optimal distance between S 1 [i..n] and S 2 [j..n] (suffixes aligned) (Easy to compute?) Assume we found some MSA with score z (e.g., by an iterative alignment) Key lemma: if D(i,j,k)+d 1,2 (i,j)+d 1,3 (i,k)+d 2,3 (j,k)>z then D(i,j,k) is not on any optimal path and thus we need not send it forward.

(Some) gory details In practice, Carillo-Lipman can align 6 sequences of about 200 characters in a “practical” amount of time Probably impractical for larger numbers Is optimality that important?

Review What is a null hypothesis? A statistician’s way of characterizing “chance.” Generally, a mathematical model of randomness with respect to a particular set of observations. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

Review What is a p-value? The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” Pr(x > S|null)

Review If BLAST returns a score of 75, how would you compute the corresponding p-value? First, compute many BLAST scores using random queries and a random database. Summarize those scores into a distribution. Compute the area of the distribution to the right of the observed score (more details to come).

Review What is the name of the distribution created by sequence similarity scores, and what does it look like? Extreme value distribution, or Gumbel distribution. It looks similar to a normal distribution, but it has a larger tail on the right.

Extreme value distribution The distribution of the maximum of a series of independently and identically distributed (i.i.d) variables Actually a family of distributions (Fréchet, Weibull and Gumbel) Shape Location Scale

What p-value is significant? The most common thresholds are 0.01 and A threshold of 0.05 means you are 95% sure that the result is significant. Is 95% enough? It depends upon the cost associated with making a mistake. Examples of costs: Doing expensive wet lab validation. Making clinical treatment decisions. Misleading the scientific community. Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = = Pr(making at least one mistake) = = There is a 64.2% chance of making at least one mistake.

Bonferroni correction Assume that individual tests are independent. (Is this a reasonable assumption?) Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = Pr(making a mistake) = Pr(not making a mistake) = Pr(not making any mistake) = = Pr(making at least one mistake) = =

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p- value threshold should you use?

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? Say that you want to use a conservative p-value of Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. A Bonferroni correction would suggest using a p-value threshold of / 1,000,000 = =

E-values A p-value is the probability of making a mistake. The E-value is a version of the p-value that is corrected for multiple tests; it is essentially the converse of the Bonferroni correction. The E-value is computed by multiplying the p- value times the size of the database. The E-value is the expected number of times that the given score would appear in a random database of the given size. Thus, for a p-value of and a database of 1,000,000 sequences, the corresponding E-value is × 1,000,000 = 1,000.

E-value vs. Bonferroni You observe among n repetitions of a test a particular p-value p. You want a significance threshold α. Bonferroni: Divide the significance threshold by α p < α/n. E-value: Multiply the p-value by n. pn < α. * BLAST actually calculates E-values in a slightly more complex way.

False discovery rate The false discovery rate (FDR) is the percentage of examples above a given position in the ranked list that are expected to be false positives. 5 FP 13 TP 33 TN 5 FN FDR = FP / (FP + TP) = 5/18 = 27.8%

Bonferroni vs. FDR Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive. FDR is the proportion of false positives among the examples that are flagged as true.

Controlling the FDR Order the unadjusted p-values p 1  p 2  …  p m. To control FDR at level α, Reject the null hypothesis for j = 1, …, j*. This approach is conservative if many examples are true. (Benjamini & Hochberg, 1995)

Q-value software

Significance Summary Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that the given score would appear in a random database of the given size.