Seminar in structural bioinformatics Multiple structural alignment of proteins By Elad Kaspani
Multiple structural alignment
Outline Introduction Introduction What is Multiple structural alignment? What is Multiple structural alignment? Why do we need Multiple structural alignment? Why do we need Multiple structural alignment? Pairwise Vs. Multiple structural alignment Pairwise Vs. Multiple structural alignment MASS - Multiple structural alignment by secondary structures MASS - Multiple structural alignment by secondary structures Problem definition Problem definition General strategy General strategy Algorithm description Algorithm description
Outline Cont. MASS - Multiple structural alignment by secondary structures MASS - Multiple structural alignment by secondary structures Algorithm outline Algorithm outline Complexity Complexity Results Discussion Results Discussion Summary & Conclusions Summary & Conclusions
Introduction Proteins sharing a common substructure may have a similar function. Proteins sharing a common substructure may have a similar function. What is Multiple structural alignment ? What is Multiple structural alignment ? Discussion – we already have pairwise alignment, isn ’ t that enough? Discussion – we already have pairwise alignment, isn ’ t that enough?
Pairwise Vs. Multiple structural alignment We have many algorithms pairwise structural alignment task We have many algorithms pairwise structural alignment task Only a few methods are available for aligning multiple structures Only a few methods are available for aligning multiple structures Most of them are based on series of pairwise comparisons Most of them are based on series of pairwise comparisons SSAPm ( SSAPm (Taylor et al., 1994) Prism Prism (Yang and Honig, 2000b) STAMP (Russell and Barton, 1992)
What do we want? Classification of existing and newly discovered proteins Classification of existing and newly discovered proteins Gaining insights into evolutionary relations between proteins Gaining insights into evolutionary relations between proteins Detecting motifs common to a group of proteins that share a certain function Detecting motifs common to a group of proteins that share a certain function Structure prediction algorithms Structure prediction algorithms
What ’ s wrong with methods based on series of pairwise comparisons ???
Multiple structural alignment These methods are limited!!! These methods are limited!!! In each pairwise comp., the only information is about the two molecules In each pairwise comp., the only information is about the two molecules alignments optimal for the whole set can be disregarded alignments optimal for the whole set can be disregarded dynamic programming disadvantage - dependent on the sequence order of the polypeptide chain dynamic programming disadvantage - dependent on the sequence order of the polypeptide chain We can ’ t see the woods We can ’ t see the woods
WHAT DO WE DO THEN????????????? WHAT DO WE DO THEN????????????? multiple structural alignment by secondary structures multiple structural alignment by secondary structuresMASS
MASS Considers all the given structures at the same time Considers all the given structures at the same time Exploiting the secondary structure representation - reduced time complexity Exploiting the secondary structure representation - reduced time complexity Does not require that all the input molecules be aligned Does not require that all the input molecules be aligned Capable of detecting structural motifs shared only by a subset of the molecules Capable of detecting structural motifs shared only by a subset of the molecules
MASS Can find non-sequential and even non- topological structural motifs Can find non-sequential and even non- topological structural motifs Suitable for a broad range of applications Suitable for a broad range of applications filter noisy results filter noisy results highly efficient and robust highly efficient and robust Other multiple-based methods Other multiple-based methods (Escalier et al., 1988) (Escalier et al., 1988) MUSTA (Leibowitz et al., 2001) MUSTA (Leibowitz et al., 2001) MultiProt (Shatsky et al., 2002) MultiProt (Shatsky et al., 2002)
Secondary structure elements (SSE) Secondary structure elements (SSE)
Basic terms rigid transformation rigid transformation Q - a subset Q - a subset T (Q) =R(Q) + t where R is a 3x3 rotation matrix and t is a translation vector T (Q) =R(Q) + t where R is a 3x3 rotation matrix and t is a translation vector ε-congruent ε-congruent For ε>0, find two largest subsets of the input sets, P and Q, and a rigid transformation, T, so that distance(P, T (Q)) 0, find two largest subsets of the input sets, P and Q, and a rigid transformation, T, so that distance(P, T (Q)) < ε How do we measure distance? How do we measure distance? RMSD RMSD
Problem Definition The pairwise case: The pairwise case: given two proteins, represented by a set of points in 3D space given two proteins, represented by a set of points in 3D space each point is associated with an atom ’ s position each point is associated with an atom ’ s position find the largest set that is congruent to two subsets of points from each protein find the largest set that is congruent to two subsets of points from each protein In computational geometry - largest common point set (LCP) problem In computational geometry - largest common point set (LCP) problem
Problem Definition The multiple case: The multiple case: given a collection of m point sets, given a collection of m point sets, find the largest set of points, of which an ε-congruent copy appears in each of the input sets find the largest set of points, of which an ε-congruent copy appears in each of the input sets Unfortunately, it ’ s NP-hard..... Unfortunately, it ’ s NP-hard..... We want not only the largest set of points, but also smaller common substructures We want not only the largest set of points, but also smaller common substructures
Problem Definition The multiple subset case: The multiple subset case: find solutions where only a subset of the input proteins is well aligned find solutions where only a subset of the input proteins is well aligned this complicates the problem ! (why?) this complicates the problem ! (why?) number of subsets is exponential number of subsets is exponential trade-off between the size of the subset and the size of its core (match list) trade-off between the size of the subset and the size of its core (match list) scoring function (core size – L, proteins # -k) f(l,k) = k scoring function (core size – L, proteins # -k) f(l,k) = k 2 )( L.
The algorithm : The algorithm :
Method Input : Input : a set of m proteins P 1, P 2,..., P m. a set of m proteins P 1, P 2,..., P m. For each protein For each protein the sequence of the 3D coordinates of atoms the sequence of the 3D coordinates of atoms assignment of SSE types to each residue assignment of SSE types to each residue Output : Output : The multiple alignments with the largest cores, according to the scoring function. The multiple alignments with the largest cores, according to the scoring function.
General strategy We want multiple alignments with at least two SSEs We want multiple alignments with at least two SSEs Bases – ordered pairs of SSEs whose ε- congruent copies appear in several proteins Bases – ordered pairs of SSEs whose ε- congruent copies appear in several proteins We look for a set of ε-congruent bases {b 1, b 2,..., b k }, from proteins P i1, P i2,..., P ik respectively. We look for a set of ε-congruent bases {b 1, b 2,..., b k }, from proteins P i1, P i2,..., P ik respectively. First base (b 1 ) is our pivot First base (b 1 ) is our pivot
General strategy – cont. Compute all the k − 1 rigid transformations between this base and the others Compute all the k − 1 rigid transformations between this base and the others Result - (T 12, T 13,..., T 1k ) defines multiple alignment between P i1, P i2,., P ik Result - (T 12, T 13,..., T 1k ) defines multiple alignment between P i1, P i2,., P ik The core may contain more then one base The core may contain more then one base we will get several alignments with almost the same transformations we will get several alignments with almost the same transformations (one alignment per base in the core) (one alignment per base in the core)
General strategy – cont. Cluster the initial multiple base alignments Cluster the initial multiple base alignments Merge the alignment. the core of the new alignment is the union of the cores of the original alignments. Merge the alignment. the core of the new alignment is the union of the cores of the original alignments. We get smaller set of multiple alignments We get smaller set of multiple alignments Extend the clustered alignments Extend the clustered alignments Find additional matching residues Find additional matching residues Give a score to each alignment Give a score to each alignment Report the highest scoring alignments Report the highest scoring alignments
Algorithm outline
Algorithm outline - stage 1 Representation of secondary structure elements: Representation of secondary structure elements: Axis representation for SSEs Axis representation for SSEs The least squares line from all the Cα atoms The least squares line from all the Cα atoms Direction & length determined by protein structure Direction & length determined by protein structure
Algorithm outline – stage 2 Detection of multiple base alignments: Detection of multiple base alignments: Use Geometric Hashing to detect bases whose ε-congruent copies appear in several proteins Use Geometric Hashing to detect bases whose ε-congruent copies appear in several proteins Each base has fingerprint Each base has fingerprint invariant to a 3D rigid transformation invariant to a 3D rigid transformation the types of the two SSEs the types of the two SSEs the angle between their axial vectors the angle between their axial vectors the midpoint-to-midpoint distance the midpoint-to-midpoint distance their line distance their line distance
Base fingerprint
Algorithm outline – stage 2 Almost-congruent bases have similar fingerprints Almost-congruent bases have similar fingerprints the types of their SSEs are the same the types of their SSEs are the same the difference between their midpoint-to- midpoint and line distances is up to 1.5 Å the difference between their midpoint-to- midpoint and line distances is up to 1.5 Å difference between their angles is up to 0.3 radians difference between their angles is up to 0.3 radians reside close to each other in the grid reside close to each other in the grid
Algorithm outline – stage 2 For each grid bin, extract all the bases of the bin and of adjacent bins For each grid bin, extract all the bases of the bin and of adjacent bins Group them together in the same base bucket Group them together in the same base bucket Base bucket - stores bases in columns according to the protein they belong to Base bucket - stores bases in columns according to the protein they belong to Bases derived from the same protein are stored in the same column Bases derived from the same protein are stored in the same column
Base bucket Almost-congruent bases are stored in the same base bucket
Stage 2 cont. A collection of almost-congruent bases, each belonging to a different column induces a local multiple alignment between the respective proteins A collection of almost-congruent bases, each belonging to a different column induces a local multiple alignment between the respective proteins core consists of at least two SSEs core consists of at least two SSEs One basis is selected as a pivot One basis is selected as a pivot rest of the bases are superimposed on it rest of the bases are superimposed on it Selection of the pivot may influence the alignment Selection of the pivot may influence the alignment Optional – try each base as pivot Optional – try each base as pivot
Stage 2 cont. Multiple alignment is defined by an underlying set of pairwise alignments Multiple alignment is defined by an underlying set of pairwise alignments For each base bucket we compute all the alignments between two bases taken from two different columns For each base bucket we compute all the alignments between two bases taken from two different columns find the transformation between two bases that aligns the maximal number of atoms with minimal RMSD find the transformation between two bases that aligns the maximal number of atoms with minimal RMSD
Cα atomic level
Stage 3 - Clustering For pair of proteins that share more then one base For pair of proteins that share more then one base We get more alignments with almost the same transformation, but a different local SSE core We get more alignments with almost the same transformation, but a different local SSE core Cluster all the local base alignments to find the ones with similar transformations Cluster all the local base alignments to find the ones with similar transformations merge them into a new global alignment merge them into a new global alignment The match list (core) of the new global alignment The match list (core) of the new global alignment union of the original local match lists union of the original local match lists its transformation is the one that aligns the SSEs with minimal RMSD its transformation is the one that aligns the SSEs with minimal RMSD
Stage 4 - Global extension Now the core of each pairwise alignment is a set of SSEs Now the core of each pairwise alignment is a set of SSEs Then we extend these alignments by finding additional matching residues Then we extend these alignments by finding additional matching residues The residues not necessarily belong to SSEs The residues not necessarily belong to SSEs We want to extend the cores of these alignments by detecting corresponding Cα atoms We want to extend the cores of these alignments by detecting corresponding Cα atoms We want to transform the second protein, so that it is fully superimposed onto the pivot protein We want to transform the second protein, so that it is fully superimposed onto the pivot protein
Stage 4 - Global extension Detect in linear time close pairs of C atoms, one atom from each protein Detect in linear time close pairs of C atoms, one atom from each protein These atom pairs are added to the alignment ’ s match list These atom pairs are added to the alignment ’ s match list transformation of the alignment is refined by employing the Least-Squares Fitting method transformation of the alignment is refined by employing the Least-Squares Fitting method
Stage 5 – Filtering & Scoring Computing the best global multiple alignments Computing the best global multiple alignments What are the best global multiple alignments? What are the best global multiple alignments? Number of aligned molecules Vs. core size Number of aligned molecules Vs. core size core size Vs. size of the smallest molecule core size Vs. size of the smallest molecule number of possible multiple alignments defined by the base buckets is exponential number of possible multiple alignments defined by the base buckets is exponential We do not compute all of them We do not compute all of them
Stage 5 – Filtering & Scoring Heuristic solution: Heuristic solution: For each BB compute the set of best multiple alignments recursively over the colomns For each BB compute the set of best multiple alignments recursively over the colomns For a set of multiple base alignments, obtained by last stage (b 1,..., b k ) For a set of multiple base alignments, obtained by last stage (b 1,..., b k ) Check if there is a base, b k+1, from the current column that improve the alignment ’ s score Check if there is a base, b k+1, from the current column that improve the alignment ’ s score Core(b 1,..., b k+1 ) = Core(b 1,..., b k ) ∩ Core(b 1, b k+1 )
Stage 5 – Filtering & Scoring Our scoring function Our scoring function Core size – L Core size – L Proteins number - k Proteins number - k f(l,k) = k f(l,k) = k Report the highest scoring alignments Report the highest scoring alignments Finish ! Finish !. () 2 L
Complexity Worst case complexity: (i) m is the number of proteins (ii) k is the number of residues in an SSE (iii) s and n are the number of SSEs and the number of residues found in each protein respectively. n ~ 300, k ~ 10, s ~ 15 The number of bases for each protein is O(s 2 )
Complexity For each pair of proteins we construct, cluster and extend O(s 4 ) pairwise alignments. This results in O(m 2 (s 4 k 3 +s 8 log s +s 4 n)) time where O(m 2 ) is the number of ways of pairing two proteins In practice, the complexity is much smaller we only construct the pairwise alignments defined by the BBs and the clustering reduces their number even more
Complexity The number of evaluated multiple alignments is linear in the number of bases Each base can be a pivot for only one multiple alignment We have O(ms 2 ) bases It takes O(ms 2 n) time to construct a single multiple alignment and O(m 2 s 4 n) time to construct all of them Running time for intire algorithm is bounded by O(m 2 s 4 (k 3 + s 2 log s + n)), but experiments show that the actual running time is significantly lower
Algorithm outline (reminder)
Results and Discussion
Experiment 1 Example 1 - Detection of subset alignments and their use for structural classification Example 1 - Detection of subset alignments and their use for structural classification We have used MASS to align a set of 12 structures from two families: We have used MASS to align a set of 12 structures from two families: Cofilin-like (CL) Cofilin-like (CL) Gelsolin-like (GL) Gelsolin-like (GL) The two families are related structurally but not sequentially The two families are related structurally but not sequentially
Experiment 1 The 12-molecule ensemble contains: The 12-molecule ensemble contains: four CL structures four CL structures eight GL eight GL The running time of MASS on this ensemble was 36 sec. The running time of MASS on this ensemble was 36 sec. (Pentium MHz processor) (Pentium MHz processor)
Experiment 1: core Vs. # Molecules
Experiment 1: Results (A) The structural alignment of all 12 proteins of the ensemble. (B) A subset alignment between only the eight GL proteins.
Experiment 1: Results (C) A subset alignment between only the four CL structures. (D) A subset alignment between only three out of the four CL structures.
Results Discussion As expected, the maximal core size decreases as the number of aligned molecules increases As expected, the maximal core size decreases as the number of aligned molecules increases The dependence is not linear: The dependence is not linear: Large decrease between three to four molecules Large decrease between three to four molecules Between four to five molecules Between four to five molecules Between eight to nine molecules Between eight to nine molecules
Experiment 2 Non-topological motif detection Non-topological motif detection The ensembles share a common SSE motif, but different topology. The ensembles share a common SSE motif, but different topology. In topological motifs, the order and the direction of the corresponding SSEs along the polypeptide chain are conserved while in non-topological they are not. In topological motifs, the order and the direction of the corresponding SSEs along the polypeptide chain are conserved while in non-topological they are not.
Experiment 2 Helix bundle ensemble: The ten proteins in this ensemble belong to four different folds and six different superfamilies Helix bundle ensemble: The ten proteins in this ensemble belong to four different folds and six different superfamilies Running time: 48 seconds. Running time: 48 seconds. Also aligned by MUSTA Also aligned by MUSTA MASS detected two additional conserved α- helices.(why ?) MASS detected two additional conserved α- helices.(why ?) MASS is secondary structure oriented MASS is secondary structure oriented Directed to find solutions that contain more SSEs Directed to find solutions that contain more SSEs
Common core is shown by assigning a different color to each conserved helix.
The schematic TOPS representation Triangles represent strands and circles helices. Corresponding secondary structure regions are drawn in the same color. As one can see the solution is non-topological.
Large-scale structural alignments MASS can be applied on the order of tens of proteins in practical running times on a standard PC MASS can be applied on the order of tens of proteins in practical running times on a standard PC Three SCOP ensembles: Three SCOP ensembles: (i) Serine proteases: all structures from the ‘ Prokaryotic trypsin-like serine protease ’ SCOP family (68 molecules) (i) Serine proteases: all structures from the ‘ Prokaryotic trypsin-like serine protease ’ SCOP family (68 molecules) (ii) PK beta barrel: all structures from the ‘ Pyruvate kinase beta-barrel domain ’ SCOP family (66 molecules); (ii) PK beta barrel: all structures from the ‘ Pyruvate kinase beta-barrel domain ’ SCOP family (66 molecules); (iii) Unrelated proteins (80 proteins) (iii) Unrelated proteins (80 proteins)
Large-scale structural alignments
Results Discussion The results show that the running time is influenced by: The results show that the running time is influenced by: (i) the number of molecules (i) the number of molecules (ii) the average molecular size (and the average number of SSEs in a molecule) (ii) the average molecular size (and the average number of SSEs in a molecule) (iii) the structural variance among the molecules (iii) the structural variance among the molecules How do you think they influence the running time??? How do you think they influence the running time???
Results Discussion The first two parameters are expected – they increase the running time as they grow The first two parameters are expected – they increase the running time as they grow The more structurally variable is the ensemble, the shorter the running time is ! (?) The more structurally variable is the ensemble, the shorter the running time is ! (?) Why? Why? the more structurally homogeneous is the input, more SSE bases are stored in the same grid bin. the more structurally homogeneous is the input, more SSE bases are stored in the same grid bin.
Summary & Conclusions - MASS Pairwise comparisons are not enough Pairwise comparisons are not enough Novel method for aligning multiple protein structures and detecting their shared core Novel method for aligning multiple protein structures and detecting their shared core Capable of detecting also cores common only to subsets of the input (proteins) Capable of detecting also cores common only to subsets of the input (proteins) exploits a secondary structure representation of proteins. exploits a secondary structure representation of proteins. Many noisy solutions are filtered out Many noisy solutions are filtered out Highly efficient capable of aligning tens of protein molecules Highly efficient capable of aligning tens of protein molecules
Summary & Conclusions - MASS Disregards the sequence order of SSEs along the polypeptide chain Disregards the sequence order of SSEs along the polypeptide chain Can find non-sequential and non-topological structural motifs. Can find non-sequential and non-topological structural motifs. Advantage over dynamic-programming based methods Advantage over dynamic-programming based methods MASS program can be run in two modes: MASS program can be run in two modes: (i) using SSE information only (reduced running time) (when will we use this option?) (i) using SSE information only (reduced running time) (when will we use this option?) (ii) using both SSE and atomic information (ii) using both SSE and atomic information
Questions? Questions?
Lecture was based on article: MASS: multiple structural alignment by secondary structures By O. Dror, H. Benyamini, R. Nussinov, and H. Wolfson (School of Computer Science, Tel Aviv University, 2003 )