Download presentation
Presentation is loading. Please wait.
1
Sequence Alignment - III Chitta Baral
2
Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection Basic mutational processes –Substitutions; –insertions; deletions (together referred to as gaps) Total Score – sum for each aligned pair + terms for each gap –Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated. –Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms –Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms –Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently Reasonable for DNA and protein sequences Inaccurate for structural RNAs
3
Substitution Matrices Notation: pair of sequence x[1..n] and y[1..m] –Let x i be the ith symbol in x –And y j be the jth symbol in y –Let p xiyi – probability that x i and y i are related –Let q xi – probbaility that we have x i by chance Frequency of occurrence of x i Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)] P(x and y supposing they are related) = p x1y1 p x2y2 … P(x and y supposing they are unrelated) = q x1 q x2 … X q y1 q y2 … Odds ratio: (p x1y1 /q x1 q y1 ) X (p x2y2 /q x2 q y2 ) X … Log-odds ratio: s(x 1,y 1 ) + s(x 2, y 2 ) + … –Where s(a,b) = log (p ab /q a q b ) –The s(a,b) table is known as the score matrix or substitution matrix
4
Gap Penalties Also based on a probabilistic model of alignment –Less widely recognized than the probabilistic basis of substitution matrices Gap of length g due to insertion of a 1 …a g –p(gap because of mutation) = f(g) (q a1 …q ag ) –p(having a1…ag by chance) = q a1 …q ag –Ratio = f(g) –Log of ratio = log (f(g)) –Geometric distribution: f(g) = ke -xg –Suppose f(g) = e -gd ; then log of ratio = -gd ## linear score –Suppose f(g) = ke -ge ; then log of ratio = -ge + log k = -ge + e + (log k - e) = - (e - log k) – (g – 1) e = - d – (g-1) e where d = e – log k ## affine score
5
Repeated matches A big string x[1..n] and smaller string y[1..m] Asymmetric: looking for multiple matches of y in x. As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y. F(i,0): Assuming x i is in an unmatched region, what is the best total score so far. F(i,j), j >= 1: Assuming x i is in a matched region and the last matching ends at x i and y j, the best total score so far. F(0,0) = 0. F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d } –F(i,0) corresponds to start over option (but now we store the total score so far) F(i,0) = maximum of –F(i-1,0) –F(i-1, j) – T j = 1, …, m –T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)
6
Illustration of repeated matches HEAGAWGHEE 00001111139 9 P 00001111139 A 00051611139 W 0000212113539 H 0102011131923159 E 0216811511192921 A 008 13615112128 E 0061318124151727
7
Next Alignment with affine gap scores. Heuristic based approach.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.