Structure superposition ≠ Structure alignment Lecture 11 Chapter 16, Du and Bourne “Structural Bioinformatics”
Why? A.Study the conformational changes of the same protein with or without ligands -- Same protein sequences B.Study the effect of mutations on protein structure -- Highly similar protein sequences C.Assessment of protein structure prediction. -- How accurate is the predicted models? -- Same protein sequences D.Remote homolog detection. Structures generally are preserved better than sequences over the course of evolution. e.g. myoglobin and -hemoglobin are homologous and have similar structures, but the sequence identity can be as low as 8.5%! E. Classification of protein folds
Structures may align well even if there sequence similarity is low. For example, an optimal superposition of myoglobin and beta-hemoglobin, which are structural neighbors. However, their sequence identity is only 8.5%! Why? Structure conservation > sequence conservation
Receiver Operating Characteristic Why? Structure conservation > sequence conservation Chothia and Lesk False positive rate (%)
ROC experiment: - For each pair P of proteins in dataset, perform alignment and record score: S(P) - Rank all pairs according to their scores, from highest to lowest. - Scan ranked pairs, and record rate of true positives and true negatives. Receiver Operating Characteristic ASIDE: Making sense of a ROC curve False positive rate (%)
ASIDE: Making sense of a ROC curve 1.00Yes 0.99Yes 0.98Yes 0.97Yes 0.96No 0.95No 0.93Yes 0.91Yes 0.89No 0.87No 0.85No 0.83No 0.83Yes 0.81No 0.77No 0.74No 0.73No 0.70No 0.69No 0.67Yes 0.62No 0.56No 0.54No 0.53No Prediction Benchmark (%)
Alignment vs. Superposition Structural alignment attempts to establish homology between two or more polymer structures based on their shape and 3D structure. Structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Conversely, simple structural superposition uses knowledge of at least some equivalent residues to guide a rigid body superposition. The most basic possible comparison between protein structures makes no attempt to align the input structures. Requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation.
Structural superposition of two CheY orthologs In pairwise structure superposition, a correspondence set of residue pairs is established by a pairwise sequence alignment.
Superposition algorithms optimize the orientation and spatial position of the two molecules with respect to each other. Superposition usually starts with a sequence comparison, which establishes the one-to-one relationships between pairs of atoms from which the RMSD is computed. This is typically a good assumption at appreciable pairwise sequence identity, but breaks down in the Twilight Zone. Once atom-to-atom relationships between two structures are established, the task of the algorithm is to achieve an optimal superposition with the smallest possible RMSD. It is usually impossible to achieve perfect overlap of all atoms pairs even for structures with 100% identical sequence. Overlaying one pair of atoms perfectly may push another pair of atoms further apart. Also, as in sequence alignment, there is a friction between global vs. local matching that must be considered. Pairwise structure superposition
Global alignment Images and content from Patrice Koehl at UCDavis Global similarity ≠ local similarity
Local alignment Structural motif Images and content from Patrice Koehl at UCDavis Global similarity ≠ local similarity
Choosing an appropriate description of structure Structure comparisons can be done at several different levels Individual atoms --disadvantages? Residue positions, which can be specified by the coordinates of C , C , and the center of mass of the side-chains What are advantages and disadvantages of using different residue representations? Small fragments Secondary structure elements (SSE)
Choosing an appropriate description of structure Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions. --In which case the RMSD reflects not only the conformation of the protein backbone but also the rotameric states of the side chains. Other comparison criteria that reduce noise and bolster positive matches include: --Secondary structure assignment --Native contact maps or residue interaction patterns --Measures of side chain packing --Measures of hydrogen bond retention
Contact map
Structure superposition requires minimizing the error within the framework of some object function. Which one? Torsion angle comparison Distance matrices Structure superposition (RMSD, TM-score, etc.) Most obvious & common Secondary structure superposition (SHEBA) This decision must also be made for structure alignment since superposition is used (many times over) in the harder problem. Choosing an object function to extremize
Torsion angles ( ) are: - local by nature - invariant upon rotation and translation of the molecule - compact (O(n) angles for a protein of n residues) Add 1 degree To all But… Images and content from Patrice Koehl at UCDavis Torsion angles
Images and content from Patrice Koehl at UCDavis Advantages - invariant with respect to rotation and translation - can be used to compare proteins Disadvantages - the distance matrix is O(n 2 ) for a protein with n residues - comparing distance matrices is a difficult problem - insensitive to chirality Distance matrices
Scoring DM similarity (or in this case, contact map)
Introduce a gap Scoring DM similarity (or in this case, contact map) In superposition, gap location is defined by an alignment! In alignment, different gap positions are tried till the best overlap is identified.
The most common parameter that expresses the difference between two protein structures is RMSD, or root mean squared deviation (distance), in atomic positions between the two structures. RMSD can be calculated as a function of all atoms or as a function of some subset of the atoms, such as the backbone or CA atoms. Using a subset of the protein atoms is common because it is likely that, when two protein structures are compared, they will not be identical to each other in sequence, and therefore the only atoms between which one-to-one comparison in position can be made will be the backbone atoms. Root mean squared deviation (RMSD)
d5d5 d4d4 d3d3 d2d2 d1d1 RMSD calculation The two structures must first be superimposed to calculate a meaningful RMSD value because they are currently in different coordinate systems !!!
d5d5 d4d4 d2d2 d1d1 RMSD calculation (with a gap) Blue1 – 2 – 3 – 4 – 5 Red1 – 2 – x – 4 - 5
Estimating RMSD by averaging distances generally gets better as the correspondence set size increases. However, RMSD must always be greater than. RMSD vs. average D as a function of n
Using RMSD to find the optimal superposition
Superposition is too complicated for manual optimization
Simplified problem (compared to structure alignment): we know the correspondence between set A and set B. We wish to compute the rigid transformation T that best align a 1 with b 1, a 2 with b 2, …, a N with b N The error to minimize is defined above. Old problem, solved in Statistics, Robotics, Medical Image Analysis, etc. Images and content from Patrice Koehl at UCDavis Using RMSD to find the optimal superposition
A rigid-body transformation T is a combination of a translation t and a rotation R, thus: T(x) = Rx + t. The quantity to be minimized is: The algorithm includes a fair amount of linear algebra (and a little bit of calculus) that is outside the scope of this class. Believe it or not, the algorithm is O(n)! Images and content from Patrice Koehl at UCDavisRepresentation of 6 “trivial” DOF Using RMSD to find the optimal superposition
Pseudocode: Superposition algorithm in reality 1.)Define error function (RMSD) 2.)Determine correspondence set (pairwise sequence alignment) 3.)Translation = align centers of mass (COM) 4.)Rotation = use matrix methods to solve for rotation that minimizes the error function (variety of methods available) 5.)Evaluate the resultant superposition 6.)Refine the superposition (b/c COM to COM may not be best translation) 7.)Iterate till convergence Using RMSD to find the optimal superposition
) Generate pairwise alignment ) Find optimal superimposition - Translation Back to our toy model… - Rotation
Sequence identity = 83% RMSD = 1.0 Å Superposition of a pair of CuZnSOD structures
Sequence identity = 83% RMSD = 1.0 Å Superposition of a pair of CuZnSOD structures
= 68% 35% = 1.6 Å 0.6 Å Superposition of several CuZnSOD structures
Ligand free Complexed with trifluoperazine Global vs. local superposition in Calmodulin
Global alignment: RMSD =15 Å (143 residues) Local alignment: RMSD = 0.9 Å (62 residues) Global vs. local superposition in Calmodulin
RMSD = 0.0 Å Aligned = 95 Z-score = 17.3 RMSD = 0.0 Å Aligned = 101 Z-score = 18.4 RMSD = 0.0 Å Aligned = 40 Z-score = 3.7 By itself, RMSD is not a very useful error function For example, consider a series of fragments all generated from the blue structure…
Up-weighting secondary structure, etc. Based on the assumption that that secondary structure elements should match-up better than coil, we can easily modify the RMSD calculation to reflect that. That is, a multiplier is applied (where x 1 > x 2 ) to up-weight the important stuff. For example, assuming the red dots correspond to secondary structures in the figure above, RMSD’ < RMSD, which might be expected to be a more accurate reflection of the similarity between the pair.
Template Modeling Score (TM-score) The TM-score is a measure of similarity between two protein structures with different tertiary structures, which is intended as a more accurate measure of the quality of full- length protein structures than the often used RMSD measures. The TM-score indicates the difference between two structures by a score between (0,1], where 1 indicates a perfect match between two structures. Generally scores below 0.20 corresponds to randomly chosen unrelated proteins whereas structures with a score higher than 0.5 assume roughly the same fold. The TM-score is designed to be independent of protein lengths. d o = Normalization factor d i = Distance between i-th residue pair L xxx = Lengths of target protein and alignment Y. Zhang, J. Skolnick, Scoring function for automated assessment of protein structure template quality, Proteins, :
RMSD vs TM-score RMSD: 12.1Å TM-score:0.81 RMSD:12.5Å TM-score:0.22 Images from Dr. Zhang at KU