1 Protein Multiple Alignment by Konstantin Davydov.

1 Protein Multiple Alignment by Konstantin Davydov

2 Papers MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou

3 Outline Introduction Introduction Background MUSCLE ProbCons Conclusion

4 Introduction What is multiple protein alignment? Given N sequences of amino acids x 1, x 2 … x N : Insert gaps in each of the x i s so that – All sequences have the same length – Score of the global map is maximum ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA

5 Introduction Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions

6 Outline Introduction Background Background MUSCLE ProbCons Conclusion

7 Background Aligning two sequences ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA

8 Background AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z

9 Background Unfortunately, this can get very expensive Aligning N sequences of length L requires a matrix of size L N, where each square in the matrix has 2 N -1 neighbors This gives a total time complexity of O(2 N L N )

10 Outline Introduction Background MUSCLE MUSCLE ProbCons Conclusion

11 MUSCLE

12 MUSCLE Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied Three stages At end of each stage, a multiple alignment is available and the algorithm can be terminated

13 Three Stages Draft Progressive Improved Progressive Refinement

14 Stage 1: Draft Progressive Similarity Measure – Calculated using k-mer counting. ACCATGCGAATGGTCCACAATG k-mer: ATG CCA score: 3 2

15 Stage 1: Draft Progressive Distance estimate – Based on the similarities, construct a triangular distance matrix. XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

16 Stage 1: Draft Progressive Tree construction – From the distance matrix we construct a tree XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X1X1 X2X2 X3X3 X4X4 X2X2 X3X3 X4X4 X1X1 X4X4 X2X2 X3X3 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1

17 Stage 1: Draft Progressive

18 Stage 1: Draft Progressive Progressive alignment – A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root. X1X1 X4X4 X2X2 X3X3 X3X3 X3X3 X2X2 X2X2 X4X4 X4X4 X1X1 X1X1 Alignment of X 1, X 2, X 3, X 4

19 Stage 2: Improved Progressive Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated. X1X1 X4X4 X3X3 X2X2 XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

20 Stage 2: Improved Progressive Similarity Measure – Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC TCC--AA TCA--AA

21 Stage 2: Improved Progressive Tree construction – A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

22 Stage 2: Improved Progressive Tree comparison – The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed

23 Stage 2: Improved Progressive Progressive alignment – A new progressive alignment is built X2X2 X4X4 X1X1 X3X3 X3X3 X3X3 X1X1 X1X1 X4X4 X4X4 X2X2 X2X2 New Alignment

24 Stage 3: Refinement Performs iterative refinement

25 Stage 3: Refinement Choice of bipartition – An edge is removed from the tree, dividing the sequences into two disjoint subsets X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3

26 Stage 3: Refinement Profile Extraction – The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC X1X1 X3X3 X4X4 X5X5 X2X2 TCC--AA TCA--AA TCA--GA T--CTGC G--ATAC TCCAA TCAAA

27 Stage 3: Refinement Re-alignment – The two profiles are then realigned with each other using profile-profile alignment. TCA--GA T--CTGC G--ATAC T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCCAA TCAAA

28 Stage 3: Refinement Accept/Reject – The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded. T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC NewOld OR

29 MUSCLE Review Performance – For alignment of N sequences of length L – Space complexity: O(N 2 +L 2 ) – Time complexity: O(N 4 +NL 2 ) – Time complexity without refinement: O(N 3 +NL 2 )

30 Outline Introduction Background MUSCLE ProbCons Conclusion

31 Hidden Markov Models (HMMs) M JI AGCC-AGC -GCCCAGT IMMMJMMM X Y -- Y X :X :Y

32 Pairwise Alignment Viterbi Algorithm – Picks the alignment that is most likely to be the optimal alignment – However: the most likely alignment is not the most accurate – Alternative: find the alignment of maximum expected accuracy

33 Lazy Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best student MEA Approach: Weighted majority vote 4. F 4. T 4. F 4. T A-AB A B+B+B- C

34 Viterbi vs MEA Viterbi – Picks the alignment with the highest chance of being completely correct Maximum Expected Accuracy – Picks the alignment with the highest expected number of correct predictions

35 ProbCons Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment. Uses Maximum Expected Accuracy instead of the Viterbi alignment. 5 steps

36 Notation Given N sequences S = {s 1, s 2, … s N } a* is the optimal alignment

37 ProbCons Step 1: Computation of posterior- probability matrices Step 2: Computation of expected accuracies Step 3: Probabilistic consistency transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement

38 Step 1: Computation of posterior-probability matrices For every pair of sequences x,y  S, compute the matrix P xy P xy (i, j) = P(x i ~y j  a* | x, y), which is the probability that x i and y j are paired in a*

39 Step 2: Computation of expected accuracies For a pairwise alignment a between x and y, define the accuracy as: accuracy(a, a*) = # of correct predicted matches length of shorter sequence

40 Step 2: Computation of expected accuracies (continued) MEA alignment is found by finding the highest summing path through the matrix M xy [i, j] = P(x i is aligned to y j | x, y)

41 Consistency z x y xixi yjyj y j’ zkzk

42 Step 3: Probabilistic consistency transformation Re-estimate the match quality scores P(x i ~y j  a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison: P(x i ~y j  a* | x, y)P(x i ~y j  a* | x, y, z)

43 Step 3: Probabilistic consistency transformation (continued)

44 Step 3: Probabilistic consistency transformation (continued) Since most of the values of P xz and P zy will be very small, we ignore all the entries in which the value is smaller than some threshold w. Use sparse matrix multiplication May be repeated

45 Step 4: Computation of guide tree Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum- of-pairs XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4

46 Step 5: Progressive alignment Align sequence groups hierarchically according to the order specified in the guide tree. Alignments are scored using sum-of-pairs scoring function. Aligned residues are scored according to the match quality scores P(x i ~y j  a* | x, y) Gap penalties are set to 0.

47 Post-processing step: iterative refinement Much like in MUSCLE Randomly partition alignment into two groups of sequences and realign May be repeated X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3

48 ProbCons overview ProbCons demonstrated dramatic improvements in alignment accuracy Longer running time Doesn’t use protein-specific alignment information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.

49 Outline Introduction Background MUSCLE ProbCons Conclusion

50 Conclusion MUSCLE demonstrated poor accuracy, but very short running time. ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.

51 Results

52 Reliability Scores

53 Questions?

54 References Robert C Edgar – MUSCLE: a multiple sequence alignment method with reduced time and space complexity Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou – ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment

55 References (continued) Slides on Multiple Sequence Alignment, CS262 Slides on Sequence similarity, CS273 Slides on Protein Multiple Alignment, Marina Sirota

1 Protein Multiple Alignment by Konstantin Davydov.

Similar presentations

Presentation on theme: "1 Protein Multiple Alignment by Konstantin Davydov."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Protein Multiple Alignment by Konstantin Davydov.

Similar presentations

Presentation on theme: "1 Protein Multiple Alignment by Konstantin Davydov."— Presentation transcript:

Similar presentations

About project

Feedback