CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.

CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics

Roadmap Last lecture review –Affine gap penalty (more today) –Local sequence alignment –Statistics of substitution matrices Statistics of alignment scores Sequence alignment and FSA –Affine gap penalty –More complex models

Seq Alignment Algorithms Global alignment –Basic: Needleman-Wunsch –Variants (LCS, overlapping, …) –Bounded DP (pruning search space) –Linear space (divide-and-conquer) –Affine gap penalty Local Alignment –Basic: Smith-Waterman –All tricks in global alignment applicable Bounded DP, linear space, affine gap

The local alignment problem Given two strings X = x 1 ……x M, Y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) +  (x i, y j ) Iteration: F(i, j) = max

The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

Analysis Time: –O(MN) for finding the best alignment –Depending on the number of sub-opt alignments Memory: –O(MN) –O(M+N) possible

The statistics of alignment Where does  (x i, y j ) come from? Are two aligned sequences actually related?

Protein substitution matrix  : score to align amino acid s against t p s, p t, frequency of s and t in database q st : the frequency that s is aligned to t in real homologous sequences Log odds ratio Scaling factor

BLOSUM matrices p s, p t, q st estimated from trusted alignments in the BLOCKS database Eliminate near-identical sequences –BLOSUM-N: constructed from sequences where identity between any pair of sequences is less than N% –BLOSUM-62: good for most purposes 45 62 90 Weak homologyStrong homology

DNA substitution matrix Given the percent identity you would like to detect and some assumptions You can get the substitution matrix by some calculation

Example Assume p A = p C = p T = p G = 0.25 We want 88% identity q AA = q CC = q TT = q GG = 0.22 The rest = 0.12/12 = 0.01 ACGT A5-7 C 5 G 5 T 5

Arbitrary substitution matrix Even arbitrary substitution matrix has meaning Better know what you are doing Solve a polynomial function to obtain the scaling factor Calculate target frequency q st Calculate target percent identity

Example ACGT A1-2 C 1 G 1 T 1 ACGT A5-4 C 5 G 5 T 5 = 1.33 q st = 0.24 for s = t, and 0.004 for s ≠ t Translate: 95% identity = 1.21 q st = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity

Today Significance of alignment score Sequence alignment and FSA

Statistics of Alignment Scores Q: How do we assess whether an alignment provides good evidence for homology? –Is a score 82 good? What about 180? A: determine how likely it is that such an alignment score would result from chance

Most of the theory applies to local alignment For global alignment, your best bet is to do Monte-Carlo simulation –Randomly shuffle your sequences before alignment –What’s the chance you can get a score as high as the real alignment?

Procedure to estimate the significance of a global alignment –Given sequence X, Y –Global alignment score = S –Randomly shuffle sequence X (or Y) N times, obtain X 1, X 2, …, X N –Align each X i with Y, let the score be S i –Plot the distribution of S i, and see where the real S locates

…………………………………………………… Mouse HEXA Human HEXA Score = 732

732 Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences

Human HEXA Fly HEXO1 Score = -74

-74 Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences

P-value of alignment p-value –The probability that the alignment score can be obtained from aligning random sequences –Small p-value means the score is unlikely to happen by chance A p-value 0.05 means you are 95% sure that the result is significant.

What p-value is significant? The most common thresholds are 0.01 and 0.05. Is 95% enough? It depends on the cost associated with making a mistake. Examples of costs: –Doing expensive wet lab validation. –Making clinical treatment decisions. –Misleading the scientific community. Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

-74 There are 88 random sequences with alignment score >= -74. Therefore P-value = 88 / 200 = 0.44 => alignment is not significant

732 There are no random sequences with alignment score >= 732. Therefore the P-value is less than 1 / 200 = 0.05 => significant Even though the p-value looks much smaller than 0.05, we cannot say anything unless we generate more random sequences

Drawbacks Monte-Carlo may take long time Cannot accurately estimate p-value if p is small To get 10 -5 p-value, have to align 10 5 random sequences –Unless we can fit a distribution Such distribution may not be generalizable No theory exists for global alignment score distribution

Statistics for local alignment Theory much more elegant Score for ungapped local alignment follows extreme value distribution (Gumbel distribution) This distribution is characterized by a larger tail on the right.

Normal distributionExtreme value distribution Intuitive interpretation for extreme value distribution Randomly sample 100 numbers from a normal distribution, and compute max Repeat 100 times. The max values will follow extreme value distribution

Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. For score S, this probability is calculated as

Computing a p-value

Statistics for local alignment How does this apply to sequence alignment? Given two unrelated sequences of lengths M, N Expected number of local alignments with score >= S can be calculated by –E(S) = KMN exp[- S] –Known as E-value – : scaling factor as computed in last lecture –K: empirical parameter ~ 0.1 Depend on sequence composition and substitution matrix

P-value for alignment score P-value for a local alignment score S when P is small.

Example You are aligning two sequences, each has 1000 bases m = 1, s = -1, d = -inf (ungapped alignment) You obtain a score 20 Is this score significant?

= ln3 = 1.1 E(S) = K MN exp{- S} E(20) = 0.1 * 1000 * 1000 * 3 -20 = 3 x 10 -5 P-value = 3 x 10 -5 << 0.05 The alignment is significant

Distribution of 1000 random sequence pairs 20

Multiple-testing problem What if you are searching a 1000-base sequence against a database of 10 6 sequences (average length 1000 bases)? How significant is a score 20 now? You are essentially comparing 1000 bases with 1000x10 6 = 10 9 bases (ignore edge effect) E(20) = 0.1 * 1000 * 10 9 * 3 -20 = 30 By chance we would expect to see 30 matches P-value = 1 – e -30 = 0.9999999999 Not significant at all

Statistics for gapped local alignment Theory not well developed Extreme value distribution works well empirically Need to estimate K and empirically –Given the database and substitution matrix, generate some random sequence pairs –Do local alignment –Fit an extreme value distribution to obtain K and

CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.

Similar presentations

Presentation on theme: "CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.

Similar presentations

Presentation on theme: "CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics."— Presentation transcript:

Similar presentations

About project

Feedback