Presented By Cindy Xiaotong Lin

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

5.1 Real Vector Spaces.
Markov chains Assume a gene that has three alleles A, B, and C. These can mutate into each other. Transition probabilities Transition matrix Probability.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Random walk Presented by Changqing Li Mathematics Probability Statistics.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Андрей Андреевич Марков. Markov Chains Graduate Seminar in Applied Statistics Presented by Matthias Theubert Never look behind you…
Hidden Markov Models Fundamentals and applications to bioinformatics.
Planning under Uncertainty
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Elementary hypothesis testing
Heuristic alignment algorithms and cost matrices
Chapter 4: Stochastic Processes Poisson Processes and Markov Chains
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Similar Sequence Similar Function Charles Yan Spring 2006.
Discrete Random Variables and Probability Distributions
Chi Square Distribution (c2) and Least Squares Fitting
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Basic Local Alignment Search Tool
Properties of Random Numbers
5-3 Inference on the Means of Two Populations, Variances Unknown
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Sample Size Determination Ziad Taib March 7, 2014.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
stochastic processes(2)
6. Markov Chain. State Space The state space is the set of values a random variable X can take. E.g.: integer 1 to 6 in a dice experiment, or the locations.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Statistical Techniques I
Simulation Output Analysis
ME 2304: 3D Geometry & Vector Calculus Dr. Faraz Junejo Double Integrals.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
1 As we have seen in section 4 conditional probability density functions are useful to update the information about an event based on the knowledge about.
Statistical Decision Theory
Random Sampling, Point Estimation and Maximum Likelihood.
Statistical Analysis of Loads
Probability Theory School of Mathematical Science and Computing Technology in CSU Course groups of Probability and Statistics.
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
1 5. Functions of a Random Variable Let X be a r.v defined on the model and suppose g(x) is a function of the variable x. Define Is Y necessarily a r.v?
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Comp. Genomics Recitation 3 The statistics of database searching.
Chapter 5.6 From DeGroot & Schervish. Uniform Distribution.
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
1 Functions of a Random Variable Let X be a r.v defined on the model and suppose g(x) is a function of the variable x. Define Is Y necessarily a r.v? If.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
From the population to the sample The sampling distribution FETP India.
11. Markov Chains (MCs) 2 Courtesy of J. Bard, L. Page, and J. Heyl.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Richard Kass/F02P416 Lecture 6 1 Lecture 6 Chi Square Distribution (  2 ) and Least Squares Fitting Chi Square Distribution (  2 ) (See Taylor Ch 8,
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.
5 Systems of Linear Equations and Matrices
11. Conditional Density Functions and Conditional Expected Values
11. Conditional Density Functions and Conditional Expected Values
Presentation transcript:

Presented By Cindy Xiaotong Lin Random Walks Presented By Cindy Xiaotong Lin

Why Random Walks? A random walk (RW) is a useful model in understanding stochastic processes across a variety of scientific disciplines. Random walk theory supplies the basic probability theory behind BLAST ( the most widely used sequence alignment theory).

What is a Random Walk? An Intuitive understanding: A series of movement which direction and size are randomly decided (e.g., the path a drunk person left behind). Formal Definition: Let a fixed vector in the d-dimensional Euclidean space and a sequence of independent, identically distributed (i.i.d.) real-valued random variables in . The discrete-time stochastic process defined by is called a d-dimensional random walk

Definitions (cont.) If and RVs take values in , then is called d-dimensional lattice random walk. In the lattice walk case, if we only allow the jump from to where or , then the process is called d-dimensional sample random walk.

Definitions (cont.) A random walk is defined as restricted walk if the walk is limited to the interval [a, b]. The endpoints a and b are called absorbing barriers if the random walk eventually stays there forever; or reflecting barriers if the walk reaches the endpoint and bounces back.

Example: sequence alignment modeled as RW | | | ||| || ||| ggagactgtagacagctaatgctata gaacgccctagccacgagcccttatc Simple scoring schemes: at a position: +1, same nucleotides -1, different nucleotides *

Example (cont.): simple RW Ladder point Ladder Point (LP):the point in the walk lower than any previously reached points. Excursion: the part of the walk from a LP until the highest point attained before the next LP. Excursions in Fig: 1, 1, 4, 0, 0, 0, 3; BLAST theory focused on the maximum heights achieved by these excursions.

Example (cont.): General RW Consider arbitrary scoring scheme (e.g. substitution matrix)

Primary Study of RW: 1-d simple RW RW: Consider a 1-d simple RW starting at h, restricted to the interval [a, b], where a and b are absorbing barriers, and Problems: I. (Absorption Probabilities) what is the probability that eventually the walk finishes at b (or a) rather than a (or b), i.e., (or )? II. What is the mean number of steps taken until the walk stops ( )?

Methods The Difference Equation Approach Classical The Moment-Generating Function Approach Ready to generate to more complicate walk

Difference Equation Approach (M1) Assume: the probability that the simple random walk eventually finishes (absorbed) at b. Difference Equation obtained by comparing the situation just before and after the first step of the walk: (7.4) Initial Conditions: (7.5)

M1 (cont.): solutions Solve Equ 7.4, using the theory of homogeneous difference equations when : The same procedure can be used to obtain the probability that the walk ends at a,

M1 (cont.): mean number of steps Difference Equation: Initial Conditions: Solution:

Moment-Generating function Approach (M2) Recall the definition of mgf of a random variable Y: In our case, mgf of random variable is: According to Theorem 1.1, there exists a unique nonzero value of such that (7.12)

M2 (cont.) The mgf of the total displacement after N steps is from (2.17) When the walk has just finished, the total displacement is either or with the probabilities of or respectively:

M2 (cont.) Therefore, we have Thus, Which is identical to (7.9), the solution from difference equation approach.

M2(cont.): Mean number of steps until the walk stops Assume the total displacement after N steps is Theorem 7.1(Wald’s Identity) states: Derivative with respect to on both sides, and obtain

M2(cont.) In , (7.24) The mean of displacement in N steps The mean of step size Which states: the mean value of the final total displacement of the walk, is the mean size of each step multipled by the mean number of steps taken until the walk stops

M2(cont.) The mean of number of steps until the walk stops, Which is agree with the result from difference equation approach

An Asymptotic case: a walk BLAST concerns The walks BLAST concerns are, a walk without upper boundary and ending at -1. Applying the previous results and We get the following Asymptotic results: The probability distribution of the maximum value that the walk ever achieves before reaching -1 is in the form of the geometric-like probability. The mean number of steps until the walk stops,

General Walk Suppose generally the possible step sizes are, and their respective probabilities are, The mean of step size is negative, i.e., The mgf of S(step size) is,

General Walk (cont.) According to Theorem 1.1, there exists unique positive , such that, To consider the walk that start at 0, with stopping boundary at -1 and without upper boundary, impose an artificial barrier at The possible stopping points can be, And Wald’s Identity states, where, is the total displacement when the walk stops.

General Walk Thus, Where, is the probability that the walk finishes at the point k. The mean of number of steps until the walk stops or would be

General Walk: unrestricted Objective: Find the probability distribution of the maximum value that the walk ever achieves before reaching -1 or lower. Define: the probability that in the unrestricted walk, the maximum upward excursion is or less; is the probability that the walk visits the positive value before reaching any other positive value.

General Walk: unrestricted Therefore, The event that in the unrestricted walk the maximum upward excursion is y or less is the union of the event that the maximum excursion never reaches positive values and the events the first positive value achieved by the excursion is k, k=1,2,…y, then the walk never achieves a further height exceeding y-k Applying the Renewal Theorem, we have,

General Walk: restricted Consider general walk starting at 0, lower barrier at -1. The size of an excursion of the unrestricted walk can exceed the value either before or after reaching negative value, i.e., Where, the probability that the size of an excursion in the restricted walks exceeds the value up y. is the probability that the first negative value reached by the walk is .

General Walk: restricted Then,

Application: BLAST BLAST is the most frequently used method for assessing which DNA or protein sequences in a large database have significant similarity to a given query sequence; a procedure that searches for high-scoring local alignments between sequences and then tests for significance of the scores found via P-value. The null hypothesis to be test is that for each aligned pair of animo acids, the two amino acids were generated by independent mechanism.

BLAST (cont.) : modeling The positions in the alignment are numbered from left to right as 1, 2,…, N. A score S(j, k) is allocated to each position where the aligned amino acid pair (j,k) is observed, where S(j,k) is the (j,k) element in the substitution matrix chosen. An accumulated score at position i is calculated as the sum of the scores for the various amino acid comparison at position 1, 2,…,i. As i increases, the accumulated score undergoes a random walk.

BLAST (cont.) : calculating parameters Let Y1, Y2,… be the respective maximum heights of the excursions of this walk after leaving one ladder point and before arriving the next, and let Ymax be the maximum of these maxima. It is in effect the test statistic used in BLAST. So it is necessary to find its null hypothesis distribution. The asymptotic probability distribution of any Yi is shown to be the geometric-like distribution. The values of C and in this distribution depend on the substitution matrix used and the amino acid frequencies {pj} and {pj’}. The probability distribution of Ymax also depends on n, the mean number of ladder points in the walk.

Discussion ???