Random Walks and BLAST Marek Kimmel (Statistics, Rice) 713 348 5255

Slides:



Advertisements
Similar presentations
Chapter 7 Hypothesis Testing
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
N.D.GagunashviliUniversity of Akureyri, Iceland Pearson´s χ 2 Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted.
Random walk Presented by Changqing Li Mathematics Probability Statistics.
Presented By Cindy Xiaotong Lin
Chapter 10: Hypothesis Testing
Significance Testing Chapter 13 Victor Katch Kinesiology.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Heuristic alignment algorithms and cost matrices
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Evaluating Hypotheses
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Similar Sequence Similar Function Charles Yan Spring 2006.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
8-2 Basics of Hypothesis Testing
Inferences About Process Quality
Basic Local Alignment Search Tool
Chapter 9 Hypothesis Testing.
Protein Sequence Comparison Patrice Koehl
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Component Reliability Analysis
Multiple testing correction
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Lecture Slides Elementary Statistics Twelfth Edition
Overview Definition Hypothesis
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 8-1 Review and Preview.
Hypothesis Testing.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Slide Slide 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim about a Proportion 8-4 Testing a Claim About.
Significance Tests: THE BASICS Could it happen by chance alone?
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
Essential Statistics Chapter 131 Introduction to Inference.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Comp. Genomics Recitation 3 The statistics of database searching.
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
1 Chapter 8 Hypothesis Testing 8.2 Basics of Hypothesis Testing 8.3 Testing about a Proportion p 8.4 Testing about a Mean µ (σ known) 8.5 Testing about.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Section 8-2 Basics of Hypothesis Testing.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
AP Statistics Section 11.1 B More on Significance Tests.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Sequence Alignment.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Copyright © 2009 Pearson Education, Inc. 9.2 Hypothesis Tests for Population Means LEARNING GOAL Understand and interpret one- and two-tailed hypothesis.
Chapter 9 Hypothesis Testing Understanding Basic Statistics Fifth Edition By Brase and Brase Prepared by Jon Booze.
Slide Slide 1 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing Chapter 8.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Sequence comparison: Significance of similarity scores
Chapter 9 Hypothesis Testing.
Sequence comparison: Multiple testing correction
Chapter 10 Analyzing the Association Between Categorical Variables
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
Presentation transcript:

Random Walks and BLAST Marek Kimmel (Statistics, Rice)

Outline Explaining the connection Simple RW with absorption Moment-generating function method Size and duration of excursions Renewal equation and general RW Significance of alignments in BLAST

Intuitive introduction Alignment as a random walk g g a g a c t g t a g a c g a a c g c c c t a g c c Scores: match = 1, mismatch = -1 Solid symbols = ladder points, squares = excursions

Relation to BLAST Quality of alignment reflected by the course of the RW. Distribution of maximum heights of excursions achievable by chance, provides null hypothesis.

Simple RW with absorbing boundaries We consider the case p  q only

Absorption probabilities Consider backward equation 2 nd order, homogeneous, linear, difference equ. where

Absorption probabilities This provides Constants derived from boundary conditions

Mean number of steps to absorption 2 nd order, inhomogeneous, difference equ. Solution = any particular solution of (*) + general solution of the corresponding homogeneous equ. Verify is a particular solution, and therefore

Moment-generating function approach

Simple RW: Until absorption

Moment-generating function approach Sticky argument now: At the time of absorption, But the latter is equal to 1

Stopping time (at absorption)

Asymptotics (p < q) Hypotheses So, define Y = excursion height

Asymptotics of the mean time to absorption A = Mean{# steps before absorption at -1} Since we have

Random walks versus alignments

Anatomy of an excursion Pr[Y i  y] ~ Cexp(-  *y) A= E[inter-ladder pts. distance] A and C difficult to compute

P-values for a BLAST comparison Assume comparison of two sequences of length N, with expected ladder points distance A. This gives n=N/A excursions on the average. Also, let us denote From expression (2.134) we have (since Y is geometric-like) Making substitutions we obtain

P-values for a BLAST comparison From previous slide Let us assume a normalized score Substituting into previous inequality, we obtain So, P-value, corresponding to an empirically obtained maximum score, equals

P-values for a BLAST comparison Expected value of the normalized score is equal approximately to Euler’s constant This yields Both and are invariant with respect to multiplication of the score by a constant (why?)

P-values for a BLAST comparison Expected number of excursions of height at least equal to v For an empirically found value of the score, By comparison with a previous formula we see

From the BLAST course To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of (i) real but non-homologous sequences; (ii) real sequences that are shuffled to preserve compositional properties or (iii) sequences that are generated randomly based upon a DNA or protein sequence model. Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

As demonstrated above, scores of local alignments are covered by a well- developed theory. For global alignments, Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily. –It is possible to express the score of interest in terms of standard deviations from the mean, but it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value; the tail behavior of global alignment scores is unknown. –The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than One further pitfall to avoid is exaggerating the significance of a result found among multiple tests. –When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly. –An alignment with P-value in the context of a single trial may be assigned a P- value of only 0.1 if it was selected as the best among 1000 independent trials. From the BLAST course

The E-value of equation applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths? One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database. From the BLAST course

An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues. Examining equation this can be accomplished simply by treating the database as a single long sequence of length N. The BLAST programs take this approach to calculating database E- value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance. From the BLAST course

Comparison of two unaligned sequences Until now, a fixed ungapped alignment in the comparison of two sequences of length N each. Now, given two sequences of lengths N 1 and N 2 without any specific alignment (total N 1 + N 2 – 1 ungapped alignments). Theory advanced, we give only highlights of results. Many conclusions of the previous sections carry over with N substituted by N 1 N 2.

Scores The basic score is re-defined now Mean number of (independent) ladder points in all alignments Since the heights of excursions are geometric-like rv’s (n of them),

Scores From the previous slide Define standardized score Expected count of (independent) excursions of height at least y Similar expressions as before for expected score and P-value

Karlin-Altschul sum statistic Idea: Add information from the r-1 “next to the highest” excursions It was proved that The particular statistics used

Choice of r and multiple testing Usually, all sum tests are performed for all “available” r The best P-value is accepted, following heuristic corrections (see Section 9.3.4),

Comparison of a query sequence against a database Use Poisson distribution to obtain the following probability Since database is of length D, then expected # HSPs with scores  v For all other Analyze Example