Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions –Extreme Value Distribution –Lambda and K revealed –Loose Ends
Lecture 6 CS5662 Modeling Expectation Reduced model: Coin tosses –Given: N coin tosses Probability of heads p –Problem: What is the average number of longest run of heads? –Solution: Experimental: Perform several repetitions and count Theoretical: E(Run max ) = log 1/p N –For example, for fair coin and 64 tosses, E(Run max ) = 6
Lecture 6 CS5663 Random alignment as Coin tosses Head = Match Assume –Score = Run of matches –Maximum score = Longest run of matches Therefore –Same model of expectation –For example: For DNA sequences of length N, E(matchlength max ) = Expected longest run of matches = log 1/p N
Lecture 6 CS5664 Local alignment as Coin tosses Assume –Score in local alignment = Run of matches –Maximum score = Longest run of matches Therefore –Similar model of expectation –For DNA sequences of length n & m E(Matchlength max ) ~ log 1/p (nm)(Why not just n or m?) ~ log 1/p (K ’ nm) Var(Matchlength max ) = C (i.e., Independent of sample space)
Lecture 6 CS5665 Refining Model S = AS matrix based scoring between unrelated sequences E(S) ~ log 1/p (K’nm) ~ [ln(Knm)]/ (where = log e 1/p) Holy Grail: Need P(S > x), probability of a score between unrelated sequences exceeding x
Lecture 6 CS5666 Poisson distribution estimate of P(S > x) Consider Coin Toss Example Given [x >> E(Run max )] Define Success = (Run max x) Define P n = Probability of n successes Define y = E[Success],i.e., Average no. of successes Then, probability of n successes follows Poisson dist. P n = (e- y y n )/n! Probability of 0 successes (No score exceeding x) is given by P 0 = e- y. Then, probability of at least one score exceeding x, P(S > x) = i 0 P i = (1 - P 0 ) = 1 - e- y For Poisson distribution, y = Kmne - x. Therefore, P(S > x) = 1 – exp (-Kmne - x )
Lecture 6 CS5667 Unit Distributions Normalize Gaussian and EVD –Area under curve = 1 –Curve maximum at 0 Then –For Gaussian Mean = 0; SD = 1 P(S > x) = 1 – exp (-e -x ) –For EVD Mean = (Euler cons); Variance = 2 /6 = P(S > x) = 1 – exp (-e - (x-u) ) –Z-score representation in terms of SDs P (Z > z) = 1 – exp(-e z – )
Lecture 6 CS5668 Lambda and K = Scale factor for scoring system –Effectively converts AS matrix values to actual natural log likelihoods K = Scale factor that reduces search space to compensate for non- independence of local alignments Esimated by fitting to Poisson approximation or equation for E(S)
Lecture 6 CS5669 Treasure Trove of Probabilities Probability distribution of scores between unrelated sequences P(S unrel ) Probability distribution of number of scores from P(S unrel ) exceeding some cut-off, mean represents number of scores exceeding cut-off observed on average Probability of observing score x occurring between unrelated sequences P(S x)
Lecture 6 CS56610 Loose Ends What about gap parameters? –Short answer: No formal theory –Long answer: Found empirically Choice of parameters can be used to convert local alignment algorithm into a global alignment What about gapped alignment? –Not formally proven, but simulations show statistical behavior similar to ungapped alignment Effective sequence length n’ = n – E(matchLength max )