Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Background we’ve seen before Alignment of sequences allows us to examine homologous regions Two proteins with regions of high sequence similarity are likely to perform the same function Conserved regions point to structural similarity

Multiple Sequence Alignment Images from STRAP

Aligned regions represent spatial similarity Images from STRAP

Background we’ve seen before Alignment of sequences allows us to examine homologous regions Two proteins with regions of high sequence similarity are likely to perform the same function Conserved regions point to structural similarity Evolutionary history can be inferred from similarity Aligned residues should have evolved form the same ancestral residue

Recap on alignments Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model evolutionary events

Sequence alignment revisited Two sequences of length L require O(NL 2 ) space and O(3L 2 ) time** **different algorithms may alter complexity Time complexity is O(3L 2 ) = O(L 2 )**

Recap on alignments Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model evolutionary events Multiple sequence alignment approaches Why is the classic pairwise alignment not extendable to multiple sequences?

Sequence alignment revisited Two sequences of length L require O(L 2 ) space and O(L 2 ) time Three sequences of length L require O(L 3 ) space and O(L 3 ) time. Image from Durbin et al

Sequence alignment revisited Two sequences of length L require O(L 2 ) space and O(L 2 ) time Three sequences of length L require O(L 3 ) space and O(L 3 ) time Four sequences? N sequences? Image from Durbin et al. Generally time is O(L N )

Run-time for the calculations Let’s assume we have N sequences of length L Time complexity is O(L N ) Assume this computation takes (10) 2N-4 seconds 2 sequences take 1 second 3 sequences take 10 seconds In our example they had N=12 sequences 10 2*12-4 = 10 20 seconds 3 trillion years!!

Solutions Heuristic approaches to sequence alignment Progressive multiple alignment

Perform pairwise alignments for all sequences Assume a match gives a score of 1, a mismatch is -0.25, indel is -0.5 1 -.25 1 1 1 1 Total Score: 4.75

Progressive multiple alignment Perform pairwise alignments for all sequences Assume a match gives a score of 1, a mismatch is -0.25, indel is -0.5 Total Score: 0.5

Progressive multiple alignment Create guide tree from pairwise alignments Use tree to build multiple sequence alignment Align most similar sequences first (give the most reliable alignments) Align the profile to the next closest sequence

Progressive multiple alignment Create guide tree from pairwise alignments Use tree to build multiple sequence alignment Align most similar sequences first (give the most reliable alignments) Align the profile to the next closest sequence Align profiles to each other Multiple sequence alignment will be at the root of the tree

Progressive multiple alignment

ProbCons Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true biological alignment 1. Create posterior probability matrix 2. Compute expected accuracies Determine the number of correctly aligned pairs 3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re- compute match quality score 4. Compute guide tree Hierarchical clustering by expected accuracies 5. Progressive alignment using guide tree

Posterior Probabilities Use pair-HMM for sequence alignment to compute the probability that letter x i and y j are paired in the true alignment Image from Do et al

Posterior Probabilities Can be represented as 3 matrices Match state (can only move diagonal Insertion x (only i can increase) Insertion y (only j can increase) Transition probability Emission probability Transition probability

Posterior Probabilities The probability of any unique alignment a can be computed as follows π(s)=probability of starting in state s α(s i  s i+1 )=transition probability β(o i |s i )=emission probability of o i in state s i

Posterior Probabilities Compute the posterior probability that x i and y j are matched in a* (the “true” biological alignment) Many paths exist through x i and y j whose probabilities sum to.35 Path a 2 (the most probable path) has probability of.08 The probability of all paths which align x i and y j make the alignment of these two residues very likely Some other path a 2 may be the most probable path, however no single pair in its path scores as high as x i and y j

Posterior Probabilities Compute the posterior probability that x i and y j are matched in a* (the “true” biological alignment) Evaluates to 1 when x i and y j are aligned in a, 0 otherwise Therefore, the probability two residues are aligned is increased by having them appear in alignments presumed to be more probable.

Maximal Expected accuracy alignment Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch scores Gap penalties are set to 0 Imagine increasingly green squares represent aligned residues whose probability approaches 1 and increasingly red approaches 0. We want to maximize the overall probability of the path by taking the “greenest” path.

Maximal Expected accuracy alignment Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch scores Gap penalties are set to 0 Alignment computed to maximize most probable matches Finding exactly correct alignment is difficult and not crucial Maximizes the number of correctly aligned residues

Consistency-based scheme Sequence x Sequence z Sequence y zkzk xixi yjyj Take location k in sequence z z k aligns with location i in sequence x z k aligns with location j in sequence y

Consistency-based scheme Sequence x Sequence z Sequence y zkzk xixi yjyj In the ProbCons consistency-based scheme, the alignment of x i to y j will receive a high score.

Probabilistic consistency transformation Given a set of sequences, S, we can compute Remove all values in P xz and P zy that are below a certain threshold Then we obtain the probability of residues x i and y i being aligned given the set of all sequences P(x i ~x j |x) is 1 if i=j, 0 otherwise

ProbCons Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true biological alignment 1. Create posterior probability matrix 2. Compute expected accuracies Determine the number of correctly aligned pairs 3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re- compute match quality score 4. Compute guide tree Hierarchical clustering by expected accuracies 5. Progressive alignment using guide tree Sum-of-pairs mode following guide tree

Results BAliBASE benchmark data 141 reference protein alignments hand-constructed alignments structural alignments 5 reference sets with varying degrees of similarity Scored accuracy for 5 aligners and ProbCons Sum-of-pairs (SP) – number of correctly aligned residue pairs divided by total number in reference set Column Score (CS) – number of correctly aligned columns divide by total number of aligned columns in reference set Additional (ProbCons-ext) extends HMM to model long terminal insertions in x or y

Results ProbCons shows high column reliability in actual homology regions Only core blocks in BAliBASE alignment are considered “actual” ProbCons may be detecting true homology outside core regions Column reliability computed as proportion of correct pairwise matches

Results Comparative Analysis ProbCons outscores all other aligners in every benchmark dataset Runtime is moderate compared to others When ProbCons-ext is the top-scoring method, ProbCons is second in most cases Exception Reference set 4: “sequences with large N/C-terminal extensions”

Problems with insertions Insertions are penalized multiple times even though the event occurs only once!

Problems with insertions Initial set of sequences Sequences are organized into tree

Problems with insertions Pairwise sequence alignments are performed between evolutionarily closest sequences

Problems with insertions The top sequence contains an insertion and a gap penalty is incurred Gap penalty incurred for the alignment With two sequences it is not clear whether the gap represents an insertion or a deletion! This sequence may have undergone a deletion event (two Ts were removed) This sequence may have undergone an insertion event (two Ts were inserted)

Problems with insertions In the most parsimonious explanation, this gap represents a deletion in the middle sequence (it is not present in any other sequence) Will be scored as 1 pairwise match and two gaps

Problems with insertions If we examine the sequences in a tree, we can see the deletion occurred once, and has been scored accordingly.

Problems with insertions Deletion occurs in this branch

Problems with insertions Insertion occurs in this branch Let’s examine an insertion event

Problems with insertions In the most parsimonious explanation, this gap represents an insertion in the top sequence Will be scored as two gaps Aligned gaps receive no match score!

Problems with insertions The scoring issues grow with an increased number of sequences!

Problems with insertions Although this alignment represents only one insertion event, it is penalized for n-1 gaps (where n is the number of sequences) and receives no match score

Problems with insertions Although both alignments represent the same number of events, they will be scored differently. Single deletion event n-1 gap penalties and 4 C 2 match scores Single insertion event: n-1 gap penalties

Other programs Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW lowers gap penalties as multiple gaps build up in one region Infers long ancestral sequences Every insertion is modeled as original sequence! First gap penalized as usual Subsequent gaps exhibit lower gap penalty

Other programs Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW lowers gap penalties as multiple gaps build up in one region Infers long ancestral sequences Every insertion is modeled as original sequence!

Löytynoja et al. Propose skipping inserted subsequences after already being aligned Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions Use an evolutionary scoring function All transition states are described by indel rates and evolutionary distances Character emission is described by an “evolutionary substitution” model  Various models can be incorporated

Progressive alignment Algorithm to allow pre- existing gaps to be skipped Insertion has already been penalized in the top profile Keep track of pointers for all previous insertions Allow “free ride” from previous insertion during an alignment AC in both sequences aligned as usual “Free ride” is given over insertion in child branch-no penalty The remaining alignment is scored regularly

Addition of matrices Matrices are added for previous insertions as pointers to insertion in current states match states As with all recurrences the best scoring path is taken No additional time complexity

Addition of matrices Matrices are added for previous insertions as pointers to insertion in current states match states As with all recurrences the best scoring path is taken No additional time complexity Score at beginning of previous insertion Score as a regular insertion- penalty incurred Score as a previous insertion- no additional penalty incurred

Probabilistic alignment Goal is to determine the ancestral sequence Sequences consist of vectors of probabilities for all residues at each position p a (x i ) = the probability that the the i th position of sequence x has residue a (simply a Profile) For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others A100000 C000000 G010010 T001101

Probabilistic alignment Goal is to determine the ancestral sequence Sequences consist of vectors of probabilities for all residues at each position p a (x i ) = the probability that the the i th position of sequence x has residue a (simply a Profile) For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others A.500000 C 0000 G0 0011 T001100

Probabilistic scoring function Authors describe the conditional probability that residue a is in position k in the z, the ancestral sequence - Substitution probability between and b given the evolutionary distance between x and its ancestor - The probability that the the ith position of sequence x has residue b The authors define a normalized evolutionary score for matching residues at location x i and y j Equilibrium frequency of character a

Results Study 20 primate mitochondrial D-loop sequences The authors’ method produced phylogenetically consistent gaps Regions deemed indel “hot spots” by CLUSTALW are likely artifacts of the method CLUSTALW Authors’ method

Results Gaps are consistent with the phylogenetic tree

CLUSTALW-artifacts? Gaps are largely inconsistent with the phylogenetic tree

References Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and Batzoglou, S. 2005. PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research 15: 330-340. Löytynoja, A., Goldman, N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. PNAS 30:10557-10562 Durbin, R., Eddy, S., Krogh, A., Mitchison, G. 1998. Biological sequence analysis. Cambridge University Press, Cambridge, UK. http://bioalgorithms.info Gille, C., Frommel, C. STRAP: editor for STRuctural Alignments of Proteins

Sequence homology in a tree Alignments can be represented in a tree Evolutionary information contained in branch lengths and tree organization Special problems for sequence insertions being carried up the tree Trees are used to guide multiple sequence alignments which may be inferred by a program provided as input to program

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Similar presentations

Presentation on theme: "Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Similar presentations

Presentation on theme: "Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006."— Presentation transcript:

Similar presentations

About project

Feedback