Presentation is loading. Please wait.

Presentation is loading. Please wait.

Expected accuracy sequence alignment Usman Roshan.

Similar presentations


Presentation on theme: "Expected accuracy sequence alignment Usman Roshan."— Presentation transcript:

1 Expected accuracy sequence alignment Usman Roshan

2 Expected accuracy alignment The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

3 Posterior probability of x i aligned to y j Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. We define the posterior probability of the i th residue of x (x i ) aligning to the j th residue of y (y j ) in the true alignment (a*) of x and y as Do et. al., Genome Research, 2005

4 Expected accuracy of alignment We can define the expected accuracy of an alignment a as The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm Do et. al., Genome Research, 2005

5 Example for expected accuracy True alignment AC_CG ACCCA Expected accuracy=(1+1+0+1+1)/4=1 Estimated alignment ACC_G ACCCA Expected accuracy=(1+1+0.1+0+1) ~ 0.75

6 Estimating posterior probabilities If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

7 Posterior probabilities from HMM We need to sum the probabilities of all alignments where x i is aligned to y j. In other words we want:

8 Forward and backward probabilities Define f k (i) as the probability of emitting x 1 x 2 …x i given that the i th hidden state is k. Similarly the backward probability b k (i) as the probability of emitting x i+1 x i+2 …x n given that the i th hidden state is k. Both f k (i) and b k (i) can be computed quickly by dynamic programming (see HMM lecture notes pages 9 to 11)

9 Once forward and backward are computed we can calculate

10 Partition function posterior probabilities Standard alignment score: Probability of alignment (Miyazawa, Prot. Eng. 1995) If we knew the alignment partition function then

11 Partition function posterior probabilities Alignment partition function (Miyazawa, Prot. Eng. 1995) Subsequently

12 Partition function posterior probabilities More generally the forward partition function matrices are calculated as

13 Partition function matrices vs. standard affine recursions

14 Posterior probability calculation If we defined Z’ as the “backward” partition function matrices then

15 Posterior probabilities using alignment ensembles By generating an ensemble A(n,x,y) of n alignments of x and y we can estimate P(x i ~y j ) by counting the number of times x i is aligned to y j.. Note that this means we are assigning equal weights to all alignments in the ensemble.

16 Generating ensemble of alignments We can use stochastic backtracking (Muckstein et. al., Bioinformatics, 2002) to generate a given number of optimal and suboptimal alignments. At every step in the traceback we assign a probability to each of the three possible positions. This allows us to “sample” alignments from their partition function probability distribution. Posteror probabilities turn out to be the same when calculated using forward and backward partition function matrices.

17 Probalign 1.For each pair of sequences (x,y) in the input set –a. Compute partition function matrices Z(T) –b. Estimate posterior probability matrix P(x i ~ y j ) for (x,y) by 2.Perform the probabilistic consistency transformation and compute a maximal expected accuracy multiple alignment: align sequence profiles along a guide-tree and follow by iterative refinement (Do et. al.).

18 Multiple protein alignment Protein sequence alignment: hard problem for multiple distantly related proteins Several standard protein alignment benchmarks available: BAliBASE, HOMSTRAD, OXBENCH, PREFAB, and SABMARK Benchmark alignments are based on manual and computational structural alignment of proteins with known structure.

19 Measure of accuracy Sum-of-pairs score: number of correctly aligned pairs divided by number of pairs in true alignment. Column score: number of correctly aligned columns Statistical significance using Friedman rank test AACAGT AAGT_ _ AACAGT AA_ _GT Blue: correct Red: incorrect Acc: 2/4=50%

20 Experimental design Methods compared: –Probalign –PROBCONS –MUSCLE –MAFFT Probalign temperature parameter trained on RV11 subset of BAliBASE 3.0. Default (optimized) parameters for remaining programs

21 BAliBASE 3.0 DataProbalignMAFFTProbconsMUSCLE RV1169.3 / 45.367.1 / 44.667.0 / 41.759.3 / 35.9 RV1294.6 / 86.293.6 / 83.894.1 / 85.591.7 / 80.4 RV2092.6 / 43.992.7 / 45.391.7 / 40.689.2 / 35.1 RV3085.2 / 56.485.6 / 56.984.5 / 54.480.3 / 38.3 RV4092.2 / 60.392.0 / 59.790.3 / 53.286.7 / 47.1 RV5089.3 / 55.290.0 / 56.289.4 / 57.385.7 / 48.7 All87.6 / 58.987.1 / 58.686.4 / 55.882.5 / 48.5 MethodRV11RV12RV20RV30RV40RV50All MAFFTNS< 0.005NS < 0.005NS< 0.005 Probcons0.0490.0233NS < 0.005NS< 0.005 MUSCLE< 0.005 0.008< 0.005 NS< 0.005 Sum-of-pairs and column score accuracies Friedman rank test P-values

22 Heterogeneous length data I Max length / Standard dev. ProbalignMAFFTProbconsMUSCLE 500 / 10088.4 / 56.688.0 / 58.086.7 / 51.681.5 / 42.5 500 / 20088.5 / 54.687.0 / 51.987.2 / 48.981.9 / 42.4 1000 / 10091.4 / 58.190.4 / 55.789.7 / 51.684.3 / 44.1 1000 / 20090.7 / 55.089.3 / 51.489.2 / 48.783.2 / 42.5 RV40 1000 / 100 (25) 1000 / 200 (20) 92.7 / 59.3 93.0 / 57.3 91.0 / 54.8 90.8 / 52.1 89.9 / 48.2 90.6 / 47.6 BAliBASE datasets with maximum length and minimum devation BAliBASE datasets with long extensions Max length / Standard dev. ProbalignMAFFTProbcons

23 Heterogeneous length data II Max length / Standard dev. ProbalignMAFFTProbcons 500 / 100 (40)89.1 / 44.987.3 / 49.087.4 / 38.6 500 / 200 (21)88.3 / 43.885.0 / 46.486.7 / 40.0 500 / 300 (9)95.3 / 61.082.6 / 51.387.3 / 46.6 500 / 400 (5)94.6 / 55.072.0 / 38.279.8 / 38.0 1000 / 100 (15)90.2 / 43.382.4 / 36.985.4 / 27.6 1000 / 200 (12)89.2 / 38.279.7 / 32.483.6 / 27.7 1000 / 300 (7)94.5 / 52.878.3 / 42.483.9 / 34.6 1000 / 400 (5)94.6 / 55.072.0 / 38.279.8 / 38.0 BAliBASE 2.0 reference 6 datasets with max length and minimum deviation


Download ppt "Expected accuracy sequence alignment Usman Roshan."

Similar presentations


Ads by Google