Download presentation
Presentation is loading. Please wait.
1
1 Protein Multiple Alignment by Konstantin Davydov
2
2 Papers MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou
3
3 Outline Introduction Introduction Background MUSCLE ProbCons Conclusion
4
4 Introduction What is multiple protein alignment? Given N sequences of amino acids x 1, x 2 … x N : Insert gaps in each of the x i s so that – All sequences have the same length – Score of the global map is maximum ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA
5
5 Introduction Motivation Phylogenetic tree estimation Secondary structure prediction Identification of critical regions
6
6 Outline Introduction Background Background MUSCLE ProbCons Conclusion
7
7 Background Aligning two sequences ACCTGCA ACTTCAA ACCTGCA-- AC--TTCAA
8
8 Background AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z
9
9 Background Unfortunately, this can get very expensive Aligning N sequences of length L requires a matrix of size L N, where each square in the matrix has 2 N -1 neighbors This gives a total time complexity of O(2 N L N )
10
10 Outline Introduction Background MUSCLE MUSCLE ProbCons Conclusion
11
11 MUSCLE
12
12 MUSCLE Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied Three stages At end of each stage, a multiple alignment is available and the algorithm can be terminated
13
13 Three Stages Draft Progressive Improved Progressive Refinement
14
14 Stage 1: Draft Progressive Similarity Measure – Calculated using k-mer counting. ACCATGCGAATGGTCCACAATG k-mer: ATG CCA score: 3 2
15
15 Stage 1: Draft Progressive Distance estimate – Based on the similarities, construct a triangular distance matrix. XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
16
16 Stage 1: Draft Progressive Tree construction – From the distance matrix we construct a tree XXXX 0.6 XXX 0.80.2 XX 0.30.70.5 X X1X1 X1X1 X2X2 X3X3 X4X4 X2X2 X3X3 X4X4 X1X1 X4X4 X2X2 X3X3 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1
17
17 Stage 1: Draft Progressive
18
18 Stage 1: Draft Progressive Progressive alignment – A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root. X1X1 X4X4 X2X2 X3X3 X3X3 X3X3 X2X2 X2X2 X4X4 X4X4 X1X1 X1X1 Alignment of X 1, X 2, X 3, X 4
19
19 Stage 2: Improved Progressive Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated. X1X1 X4X4 X3X3 X2X2 XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
20
20 Stage 2: Improved Progressive Similarity Measure – Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC TCC--AA TCA--AA
21
21 Stage 2: Improved Progressive Tree construction – A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
22
22 Stage 2: Improved Progressive Tree comparison – The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed
23
23 Stage 2: Improved Progressive Progressive alignment – A new progressive alignment is built X2X2 X4X4 X1X1 X3X3 X3X3 X3X3 X1X1 X1X1 X4X4 X4X4 X2X2 X2X2 New Alignment
24
24 Stage 3: Refinement Performs iterative refinement
25
25 Stage 3: Refinement Choice of bipartition – An edge is removed from the tree, dividing the sequences into two disjoint subsets X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3
26
26 Stage 3: Refinement Profile Extraction – The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC X1X1 X3X3 X4X4 X5X5 X2X2 TCC--AA TCA--AA TCA--GA T--CTGC G--ATAC TCCAA TCAAA
27
27 Stage 3: Refinement Re-alignment – The two profiles are then realigned with each other using profile-profile alignment. TCA--GA T--CTGC G--ATAC T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCCAA TCAAA
28
28 Stage 3: Refinement Accept/Reject – The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded. T--CCAA T--CAAA TCA--GA T--CTGC G--ATAC TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC NewOld OR
29
29 MUSCLE Review Performance – For alignment of N sequences of length L – Space complexity: O(N 2 +L 2 ) – Time complexity: O(N 4 +NL 2 ) – Time complexity without refinement: O(N 3 +NL 2 )
30
30 Outline Introduction Background MUSCLE ProbCons Conclusion
31
31 Hidden Markov Models (HMMs) M JI AGCC-AGC -GCCCAGT IMMMJMMM X Y -- Y X :X :Y
32
32 Pairwise Alignment Viterbi Algorithm – Picks the alignment that is most likely to be the optimal alignment – However: the most likely alignment is not the most accurate – Alternative: find the alignment of maximum expected accuracy
33
33 Lazy Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key? Viterbi Approach: Use the answer sheet of the best student MEA Approach: Weighted majority vote 4. F 4. T 4. F 4. T A-AB A B+B+B- C
34
34 Viterbi vs MEA Viterbi – Picks the alignment with the highest chance of being completely correct Maximum Expected Accuracy – Picks the alignment with the highest expected number of correct predictions
35
35 ProbCons Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment. Uses Maximum Expected Accuracy instead of the Viterbi alignment. 5 steps
36
36 Notation Given N sequences S = {s 1, s 2, … s N } a* is the optimal alignment
37
37 ProbCons Step 1: Computation of posterior- probability matrices Step 2: Computation of expected accuracies Step 3: Probabilistic consistency transformation Step 4: Computation of guide tree Step 5: Progressive alignment Post-processing step: Iterative refinement
38
38 Step 1: Computation of posterior-probability matrices For every pair of sequences x,y S, compute the matrix P xy P xy (i, j) = P(x i ~y j a* | x, y), which is the probability that x i and y j are paired in a*
39
39 Step 2: Computation of expected accuracies For a pairwise alignment a between x and y, define the accuracy as: accuracy(a, a*) = # of correct predicted matches length of shorter sequence
40
40 Step 2: Computation of expected accuracies (continued) MEA alignment is found by finding the highest summing path through the matrix M xy [i, j] = P(x i is aligned to y j | x, y)
41
41 Consistency z x y xixi yjyj y j’ zkzk
42
42 Step 3: Probabilistic consistency transformation Re-estimate the match quality scores P(x i ~y j a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison: P(x i ~y j a* | x, y)P(x i ~y j a* | x, y, z)
43
43 Step 3: Probabilistic consistency transformation (continued)
44
44 Step 3: Probabilistic consistency transformation (continued) Since most of the values of P xz and P zy will be very small, we ignore all the entries in which the value is smaller than some threshold w. Use sparse matrix multiplication May be repeated
45
45 Step 4: Computation of guide tree Use E(x,y) as a measure of similarity Define similarity of two clusters by the sum- of-pairs XXXX XXX XX X X1X1 X2X2 X3X3 X4X4 X3X3 X2X2 X1X1 X4X4
46
46 Step 5: Progressive alignment Align sequence groups hierarchically according to the order specified in the guide tree. Alignments are scored using sum-of-pairs scoring function. Aligned residues are scored according to the match quality scores P(x i ~y j a* | x, y) Gap penalties are set to 0.
47
47 Post-processing step: iterative refinement Much like in MUSCLE Randomly partition alignment into two groups of sequences and realign May be repeated X5X5 X4X4 X2X2 X3X3 X1X1 X5X5 X1X1 X4X4 X2X2 X3X3
48
48 ProbCons overview ProbCons demonstrated dramatic improvements in alignment accuracy Longer running time Doesn’t use protein-specific alignment information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.
49
49 Outline Introduction Background MUSCLE ProbCons Conclusion
50
50 Conclusion MUSCLE demonstrated poor accuracy, but very short running time. ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.
51
51 Results
52
52 Reliability Scores
53
53 Questions?
54
54 References Robert C Edgar – MUSCLE: a multiple sequence alignment method with reduced time and space complexity Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou – ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment
55
55 References (continued) Slides on Multiple Sequence Alignment, CS262 Slides on Sequence similarity, CS273 Slides on Protein Multiple Alignment, Marina Sirota
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.