Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne
Why Align Structures? Additional measure of protein similarity Structure generally preserved better than sequence over the course of evolution May help in protein fold identification Interesting combinatorial problem
The Structural Alignment Problem We know how to optimally superimpose two proteins of the same length so as to minimize RMSD (Hendrickson, 1979) However, no obvious way to compare objects of different length, or to optimally add or remove gaps Heuristic methods for structural alignment are the best we can do at the moment
Alignment Fragment Pairs For a pair of proteins A and B, an alignment fragment pair (AFP) is defined as a continuous segment of A aligned against a continuous segment of B of the same size (without gaps). If n 1 and n 2 are the lengths of A and B, and AFP length is set to m, then there is a total of (n 1 m) (n 2 m) AFPs.
Defining an Alignment An alignment is defined as a continuous path of AFPs of fixed length m s.t. for every two consecutive AFPs there may be gaps inserted into either A or B, but not into both. That is, for every two consecutive AFPs i and i+1, we have 1) and or 2) and or 3) and Where p i A represents the starting position of AFP i in protein A
The CE Algorithm Goal: Find a “good” local alignment for structures of proteins A and B. Basic idea: 1.Select some initial AFP. 2.Build an alignment path by incrementally adding AFPs in a way that satisfies the conditions on the previous slide. 3.Repeat step (2) until the length of each protein is traversed, or until no “good” AFPs remain.
Algorithm Specifics How do we choose the starting AFP? What are the criteria for adding AFPs to our alignment path? How do we know when to stop? That is, at what point do we know that there no “good” AFPs left? There are various heuristics that could be used to supply answers to the above questions.
Sample Heuristics: AFP Distances We can define the distance between two different AFPs i and j as: Here, d A (p,q) represents the distance between the alpha carbon atoms at positions p and q in protein A. Setting i=j, and using the same formula, we can define the distance D ii between two fragments of the same AFP.
Sample Heuristics: Extending the Alignment Path Suppose our alignment path already consists of AFPs 0…n 1, and we are trying to decide whether to add AFP n to the path. We will do so only if: (4)
Extending Alignment Path (Cont) Where: D 0 and D 1 are specified cut-off distances. The decision whether AFP n is “fit” is based on 4. The decision whether AFP n “works” with all the other alignments in the path is based on the 5. The decision whether we should extend the alignment path at all is based on 6.
Alignment Assessment and Post-alignment Optimization To assess how good the alignment produced by CE is, we can compare it to the alignment of a random pair of structures, and compute the Z- score based on the RMSD distance and number of gaps in the final alignment. Since CE does not penalize gaps, we can perform additional optimization after the CE is completed in order to remove excess gaps using dynamic programming.
Results and Conclusion The CE method is highly configurable, which is at once its strength and weakness. Adjusting multiple parameters, such as AFP length m, cutoff distances D 0 and D 1, and definitions for AFP distances, can result varying alignments and execution speeds.
Results and Conclusion In general, CE does not outperform previously existing structural alignment methods, such as Dali and VAST: it does better for some pairs of structures, and worse for others. Since it is fairly straightforward and easy to implement, CE provides an interesting addition to the toolbox of structural alignment algorithms.