Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT
O UTLINE Introduction Alignments Pairwise vs. Multiple DNA vs. Protein MUSCLE P ROB C ONS Conclusion
I NTRODUCTION 1. Sequence Analysis – Look at DNA and protein sequences, searching for clues about structure, function and control 2. Structure Analysis – Examine biological structures, to learn more about structure, function and control 3. Functional Analysis – Understand how the sequences and structures lead to the biological function
P AIRWISE S EQUENCE A LIGNMENT ( R EVIEW) The Problem: Given two sequences of letters, and a scoring scheme for evaluating matching letters, find the optimal pairing of letters from one sequence to letters of the other sequence Basic Idea: The score of the best possible alignment that ends at a given pair of positions (i, j) is the score of the best alignment previous to those two positions plus the score for aligning those two positions.
P AIRWISE S EQUENCE A LIGNMENT ( R EVIEW)
P AIRWISE vs. M ULTIPLE PAIRWISE Evaluated by addition of match or mismatch scores for aligned pairs and affine gap penalties for unaligned pairs O(L 2 ) time and O(L) space via dynamic programming MULTIPLE Lack of proper objective scoring functions to measure alignment quality High computational cost and no efficient algorithm that can be applied L = sequence length
P ROTEIN VS. DNA DNA (4 characters) Protein (20 characters) DNA – 50% similarity Protein – 20% similarity DNA – fewer sequences to compare Protein – many sequences to compare DNA aligners need to be able to handle long sequences, protein aligners do not
P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT Note that areas that are considered very similar don’t necessarily contain the same amino acids
M OTIVATION Find similarity between known and unknown sequences Protein sequence similarity implies divergence from a common ancestor and functional similarity P ROBLEM Given n sequences and a scoring scheme for evaluating matching letters, find the optimal pairing of letters between the sequences Can be done using dynamic programming with time and space complexity O(L n ) which is not practical!!! Need new algorithms and approaches
A PPLICATIONS Evolutionary research Isolation of most relevant regions Characterization of protein families
M ORE A PPLICATIONS 3Dimentional structure prediction Phylogenetic Studies
P APERS MUSCLE: a Multiple Sequence Alignment Method with Reduced Time and Space Complexity by Robert C. Edgar ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment by Chuong B. Do, Michael Brudno, and Serafim Batzoglou
M USCLE – O VERVIEW Basic Idea: A progressive alignment is built, to which horizontal refinement is applied 3 stages of the algorithm At the completion of each, a multiple alignment is available and the algorithm can be terminated Significant improvement in accuracy and speed
M USCLE – T HE A LGORITHM Stage 1: Draft Progressive – Builds a progressive alignment Similarity of each pair of sequences is computed using K-mer counting Constructing a global alignment and determining fractional identity of the sequences A tree is constructed and a root is identified A progressive alignment is built by following the branching order of the tree, yielding a multiple alignment
M USCLE – P ROGRESSIVE A LIGNMENT
M USCLE – P ROFILE-PROFILE A LIGNMENT
M USCLE – T HE A LGORITHM Stage 2: Improved Progressive – Improves the tree Similarity of each pair of sequences is computed using fractional identity from the mutual alignment A tree is constructed by applying a clustering method to the distance matrix The trees are compared; a set of nodes for which the branching order has changed is identified A new alignment is built, the existing one is retained if the order is unchanged
M USCLE – T REE C OMPARISON
M USCLE – T HE A LGORITHM Stage 3: Refinement – Iterative Refinement is performed An edge is deleted from a tree, dividing the sequences into two disjoint subsets The profile (MA) of each subset is extracted The profiles are re-aligned to each other The score is computed, if the score has increased, the alignment is retained, otherwise it is discarded Algorithm terminates at convergence
M USCLE – I TERATIVE R EFINEMENT S T U X Z Delete this edge Realign these resulting profiles to each other S T U X Z
M USCLE Results: O(N 2 + L 2 ) Space and O(N 4 + NL 2 ) Time Complexity Improvements in selection of heuristics Close attention paid to implementation details Enables high-throughput applications to achieve good accuracy
P ROB C ONS - O VERVIEW Alignment generation can be directly modeled as a first order Markov process involving state emissions and transitions Uses maximum expected accuracy alignment method Probabilistic consistency used as a scoring function Model parameters obtained using unsupervised maximum likelihood methods Incorporate multiple sequence information in scoring pairwise alignments
P ROB C ONS – H IDDEN M ARKOV M ODEL Deletion penalties on Match => Gap transitions Extension penalties on Gap => Gap transitions Match/Mismatch penalties on Match emissions
INSERT XINSERT Y MATCH ABRACA-DABRA AB-ACARDI--- x y xixixixi yjyjyjyj ― yjyjyjyj xixixixi― Basic HMM for sequence alignment between two sequences M emits two letters, one from each sequence I x emits a letter from x that aligns to a gap I y emits a letter from y that aligns to a gap P ROB C ONS – H IDDEN M ARKOV M ODEL
P ROB C ONS - M AXIMUM E XPECTED A CCURACY L AZY T EACHER A NALOGY 10 students take a 10 question true/false quiz How do you make up the answer key? 1. Use the answers of the single best student (Viterbi Algorithm) 2. Use weighted majority rule (Maximum Expected Accuracy)
P ROB C ONS – M AXIMUM E XPECTED A CCURACY Viterbi Picks a single alignment with the highest chance of being completely correct (analogous to Needleman- Wunch) Mathematically, finds the alignment a which maximizes Ea*[1{a = a*}] (maximum probability alignment) Maximum Expected Accuracy Picks alignment with the highest expected number of correct predictions Mathematically, finds the alignment a which maximizes Ea*[accuracy(a, a*)]
P ROB C ONS – C OMPUTING MEA Define accuracy (a, a*) = the expected number of correctly aligned pairs of letters divided by the length of the shorter sequence The MEA alignment is found by finding the highest summing path through the matrix Mxy[i, j] = P(x i is aligned to y j | x, y) We just need to compute these terms! Can use dynamic programming
z x y xixi yjyj y j’ zkzk P ROB C ONS – P ROBABILISTIC C ONSISTENCY
Compute P(x i is aligned to y j | x, y) P(x i is aligned to y j | x, y, z) We can re-estimate Mxy as (Mxz)(Mzy) where z is a third sequence to which x and y are aligned Mxy[i,j] = ∑ Mxz[i.k] Mzy(k,j), where n is the length of z We follow the alignment from position i of x, to position j of y, through all intermediate positions k of a third sequence z P ROB C ONS – P ROBABILISTIC C ONSISTENCY k = 1 n
A straightforward generalization –sum-of-pairs –tree-based progressive alignment –iterative refinement ABRACA-DABRA AB-ACARDI--- ABRA---DABI- AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI--- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDIABRACADABRA ABRACA-DABRA AB-ACARDI--- ABRADABI ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDI ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARD--I- ABRA---DABI- P ROB C ONS – M ULTIPLE A LIGNMENT
P ROB C ONS – T HE A LGORITHM Step 1: Computation of posterior-probability matrices For every pair of sequences x, y compute the probability that letters x i y j are paired in a*, an alignment of x and y that is randomly generated by the model Step 2: Computation of expected accuracies Define the expected accuracy of a pairwise alignment a xy to be the expected number of correctly aligned pairs of letters divided by the length of the shorter sequence Compute the alignment a xy that maximizes expected accuracy E(x,y) using dynamic programming
P ROB C ONS – T HE A LGORITHM Step 3: Probabilistic consistency transformation Re-estimate the scores with probabilistic consistency transformation by incorporating similarity of x and y to other sequences into the pairwise comparison of x and y Computed efficiently using sparse matrix multiplication ignoring all entries smaller than some threshold Step 4: Computation of a guide tree Construct a tree by hierarchical clustering using E(x, y). Cluster similarity is defined by a weighted average of pairwise similarities between the clusters
P ROB C ONS – T HE A LGORITHM Step 5: Progressive Alignment Align sequence groups hierarchically according to the order specified in the guide tree Score using a sum of pairs function in which the aligned residues are scored according to the match quality scores and the gap penalties are set to 0 Step 6: Iterative Refinement Randomly partition alignment into two groups of sequences and realign. May be repeated as necessary
P ROB C ONS Results: Best results so far Longer in running time due to the computation of posterior probability matrices (Step 1) Doesn’t incorporate biological information Could provide improved accuracy in DNA multiple alignment
P ROB C ONS
C ONCLUSION Protein multiple alignment is a current research problem (both papers published in 2004) Many applications including evolutionary and phylogenetic studies, protein structure and classification Currently, there is some collaboration between the authors of MUSCLE and P ROB C ONS to create a new program which will combine the speed of MUSCLE-based tree construction and the accuracy that comes from using MEA and probabilistic consistency
R EFERENCES Do, C.B., Brudno, M., and Batzoglou, S. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Amino Acid Sequences. Submitted. Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), Some of these slides were adapted from Tom Do’s ISMB presentation on PROBCONS.
T HANKS!