Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Novel Geometric Build-Up Algorithm

Similar presentations


Presentation on theme: "A Novel Geometric Build-Up Algorithm"— Presentation transcript:

1 A Novel Geometric Build-Up Algorithm
for Solving the Distance Geometry Problem and Its Application to Multidimensional Scaling Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology Iowa State University Joint Work with Tauqir Bibi, Feng Cui, Qunfeng Dong, Peter Vedell, Di Wu I would like to thank the organizers for inviting me to attend the workshop and give a talk on my recent work related to multidimensional scaling. My name is Zhijun Wu. I am a joint faculty member of ISU department of mathematics and program on bioinformatics and computational biology. The work I am going to talk about is done jointly with my colleagues and students at ISU.

2 T S B Distance Geometry Multidimensional Scaling
mapping from semi-metric to metric spaces Euclidean and non-Euclidean T S Multidimensional Scaling data classification geometric mapping of data fundamental problem: find the coordinates for a set of points, given the distances for all pairs of points B Cayley-Menger determinant necessary & sufficient conditions of embedding singular-value decomposition method strain/stress minimization Molecular Conformation embedding in 3D Euclidean space protein structure prediction and determination Distance geometry, multidimensional scaling, and molecular conformation share a common interest in solving the same problem to find the coordinates for a set of points given the distances for all pairs of points. Distance geometry studies more theoretical issues such as mapping from semi-metric to metric spaces, necessary and sufficient conditions of embeddability, etc. Multidimensional scaling is more concerned with data analysis and algorithm development such as data classification, SVD algorithm for multidimensional scaling, etc. Molecular conformation is related to determination of molecular structure using inter-atmoic distance information and has an extremely important application in biology for protein structure prediction and determination. The problems in molecular conformation have more complicated distance data than what typical distance geometry or multidimensional scaling fields have covered and therefore present more challenges to the conventional approaches to the problems. My talk will be concentrated on molecular conformation problems and their solutions. The methods should be extendable to other similar distance geometry and multidimensional scaling applications as well. sparse, inexact distances, bounds on the distances, probability distributions

3 HIV Retrotranscriptase
Proteins are building blocks of life and key ingredients of biological processes. A biological system may have up to hundreds of thousands of different proteins, each with a specific role in the system. A protein is formed by a polypeptide chain with typically several hundreds of amino acids and tens of thousands of atoms. A protein has a unique 3D structure, which determines in many ways the function of the protein. HIV Retrotranscriptase an example: First, let me give you a little background about protein structure determination and explain you why it is such a big deal in biology. Proteins are basic elements of biological systems and key to the understanding of biological processes. On the other hand, the protein function is often determined by its unique 3D structure. Therefore, the determination of the 3D structure of the protein is essential. Now let’s see an example. Shown in this picture is the 3D structure of the protein HIV retrotranscriptase. The proteinhas 554 amino acids and 4200 atoms. It is believed to be responsible for helping HIV virus to invade normal cells. If we can find the 3D structure of the protein, we will be able to learn more about how the protein binds the virus and it functions, and then we can develop some drug to attach to its important sites to destroy its functions. Anyway, one of the approaches to structure determination is to use distance information between pairs of atoms to find the coordinates of the atoms and therefore the whole structure of protein. The distance information can be obtained from our knowledge on bond-lengths and bond-angles, NMR experiments, or structure database. The mathematical problem to be solved is basically a distance geometry or multidimensional scaling problem with of course possibly more complicated data types. 4200 atoms 554 amino acids

4 Molecular Distance Geometry Problem
Given n atoms a1, …, an and a set of distances di,j between ai and aj, (i,j) in S We call the problem molecular distance geometry problem. The basic one is given in this definition, but it may have a different form depending its data type and its difficulty will also vary.

5 Problems and Complexity
problems with all distances: solvable in O (n3) using SVD problems with sparse sets of distances: NP-complete (Saxe 1979) problems with distance ranges (NMR results): NP-complete (More and Wu 1997), if the ranges are small Here are the different types of problems and their possible difficulties. If the distances for all pairs of atoms are given, the problem can be solved by using a SVD method in order of n cubic floating point operations. However, if the distance data is sparse, the problem is proved to be NP-hard in general. In NMR experiments, lower and upper bounds on the distances are all we can obtain. The problem then becomes to find the coordinates so that the distances are within given bounds. We can prove that the problem will be NP-hard also if the gaps between the lower and upper bounds are smaller than certain values. The most interesting class o problems is when the probability distributions of the distances are provided for example based on statistical analysis of structure database. We call such problems the stochastic distance geometry or multidimensional scaling problems. problems with probability distributions of distances: stochastic multidimensional scaling, structure prediction

6 Current Approaches Embed Algorithm by Crippen and Havel
CNS Partial Metrization by Brünger et al Graph Reduction by Hendrickson Alternating Projection by Glunt and Hayden Global Optimization by Moré and Wu Multidimensional Scaling by Trosset, et al Several approaches to the molecular distance geometry problems have been studied. The most popular one is the embed algorithm, which has been used widely in NMR structure determination through its implementation in modeling software such as Xplor and CNS.

7 Embed Algorithm bound smooth; keep distances consistent
time consuming in O(n3~n4) bound smooth; keep distances consistent distance metrization; estimate the missing distances repeat (say 1000 times): randomly generate D in between L and U find X using SVD with D if X is found, stop select the best approximation X refine X with simulated annealing final optimization costly in O(n2~n3) Here is an outline of the embed algorithm. The algorithm takes a set of lower and upper bounds of the distances and try to find a set of coordinates for the atoms that satisfy the distance constraints. The first and second steps can be very time consuming requiring order of n to the cubic or to the 4th power of floating point operations. The third step can be costly too since it may repeat many times. Yet the algorithm cannot always guarantee an accurate solution to the problem. Crippen and Havel 1988 (DGII, DGEOM) Brünger et al 1992, 1998 (XPLOR, CNS)

8 Geometric Build-Up Independent Points: A set of k+1 points in Rk is called independent if it is not a set of points in Rk-1. Metric Basis: A set of points B in a space S is a metric basis of S provided each point of S is uniquely determined by its distances from the points in B. Fundamental Theorem: Any k+1 independent points in Rk form a metric basis for Rk. In order to improve the solutions to the molecular distance geometry problems, we have developed this so-called the geometric build-up algorithm. The idea comes from a simple fact stated in standard distance geometry theory. In distance geometry, we define a set of k+1 points in Rk as a set of independent points of Rk if it is not a set of points in Rk-1. We also define a set of points B in a space S to be a metric basis f S is each point of S is uniquely determined by its distances from the points in B. Then, there is a fundamental theorem of distance geometry, i.e., any k+1 independent points in Rk for a metric basis for Rk. Blumenthal 1953: Theory and Applications of Distance Geometry

9 Geometric Build-Up in two dimension
This is actually a well known fact in two dimensional Euclidean space. Any three points not in a line form a set of independent points and any point in the space can be determined by its distances from the three points.

10 Geometric Build-Up in three dimension
Similarly, in three dimensional space, if we have four atoms located at these positions and if they are not in the same plane, then the location of any atom can be determined uniquely by its distances from these four basis atoms.

11 Geometric Build-Up in three dimension
In other words, if we know the distances from this atom to the four basis atoms, we can also uniquely determine the position of this atom. And we can continue the same process to find the positions for all the atoms in the molecule, and we then find the structure of the molecule.

12 Geometric Build-Up 1 x1 = (u1, v1, w1) x2 = (u2, v2, w2)
? xi = (ui, vi, wi) i ||xi - x1|| = di,1 ||xi - x2|| = di,2 ||xi - x3|| = di,3 ||xi - x4|| = di,4 2 4 Algebraically, for every atom to be determined, we need to solve a system of four nonlinear equations to find the values for the three coordinates of the atom. The system can be further reduced to a linear system. The solution requires only a small constant time. So if the whole molecule has n atoms, the whole build-up process will take only order of n floating point operations. j ||xj - x1|| = dj,1 ||xj - x2|| = dj,2 ||xj - x3|| = dj,3 ||xj - x4|| = dj,4 ? xj = (uj, vj, wj) 3

13 The geometric build-up algorithm solves a molecular distance geometry problem in O(n) when distances between all pairs of atoms are given, while the singular value decomposition algorithm requires O(n2~n3) computing time! If we have the distances for all pairs of atoms, we can always find four basis atoms and fix their positions, and we can also always have for any atom its distances from the basis atoms and therefore be able to use the build-up method to find the coordinates for the atom. So, the molecular distance geometry problem can be solved in linear time if all distances are given. Recall that the solution of this problem using SVD requires order of n square to n cubic floating point operations. So the geometry build-up algorithm is obviously more efficient.

14 The X-ray crystallography structure (left) of the HIV-1 RT p66 protein (4200 atoms) and the structure (right) determined by the geometric build-up algorithm using the distances for all pairs of atoms in the protein. The algorithm took only 188,859 floating-point operations to obtain the structure, while a conventional singular-value decomposition algorithm required 1,268,200,000 floating-point operations. The RMSD of the two structures is ~10-4 Å. Here are some computational results. Shown in the pictures are the structures of HIV-1 RT p66 protein. The left one is the X-ray crystal structure of the protein. The right one is the structure determined by a geometric build-up algorithm using all distances between pairs of atoms. The RMSD between the two structures is 10 to the minus 4 so they are very close. The algorithm found the structure in about 200 thousand flops, while a conventional svd algorithm requires about 1.2 billion flops.

15 Problems with Sparse Sets of Distances
The build-up algorithm can be easily extended to problems with sparse sets of distances. Of course, in this case, the distances from one set of basis atoms to any atom are not always available any more. So we just examine every atom to see if there are four determined atoms that can serve as its basis atoms. If so, the atom can be determined immediately. The process repeats until all atoms are determined.

16 Control of Rounding Errors
However, there are two issues associated with sparse problems to prevent appropriate solutions to the problems. The first one is that the basis atoms are determined themselves by others and the rounding errors in the calculations get accumulated and eventually produce incorrect results. One way we can resolve this issue is to use the distances among the base atoms to recalculate their coordinates. The rounding errors are then stopped from propagating. The extra computation requires only a constant time for calculating the coordinates, and translating and rotating the structure to its original location.

17 Control of Rounding Errors

18 Tolerate Distance Errors
Another issue is that each atom may have more than four distances from determined atoms. If the distances are exact and consistent, the extra distance constraints can be satisfied automatically. However, if the distances have errors and are not consistent, and if we still want to obtain a best possible structure, we need to consider all those distances and satisfy them all as much as possible.

19 Tolerate Distance Errors
(i,j) in S This can be done by using an optimization technique: The coordinates for an undetermined atom can be determined by solving a least-squares problem to satisfy all distance constraints imposed from previously determined atoms to the undetermined one. j xj are determined.

20 The objective function is convex and the problem can be solved using a standard Newton method.
Each function evaluation requires order of n floating point operations, where n is the number of atoms. (i,j) in S xj are determined. In the ideal case when every atom can be determined, n atoms require O(n2) floating point operations. As stated.

21 NMR Structure Determination
The distances are given with their possible ranges. i In real NMR structure determination, the distance data is more complicated: only their ranges can be estimated. Then the problem becomes to find the coordinates for the atoms so that the distances between certain pairs of atoms are within their given ranges, or in other words, in between their lower or upper bounds. For this reason, in the build-up procedure, we also need to determine the coordinates of any atom so that the distances from this atom to previously determined atoms are all in their given bounds. j

22 (i, j) in S This again can be done by solving a similar least-squares problem as in the case with inexact distances.

23 Computational Results
The structure of 4MBA (red lines) determined by using a geometric build-up algorithm with a subset of all pairs of inter-atomic distances. The X-ray crystallography structure is shown in blue lines. as stated.

24 Computational Results
The total distance errors (red) for the partial structures of a polypeptide chain obtained by using a geometric build-up are all smaller than 1 Å, while those (blue) by using CNS (Brünger et al) grow quickly with increasing numbers of atoms in the chain.

25 Extension to Statistical Distance Data
the distributions of the distances in structure database i Finally, I want to mention that the build-up algorithm can also be used in the same fashion for problems with probability distributions of the distances. Assuming that the distance distributions are obtained for the atoms in the given protein, a structure can then be determined by maximizing the joint probability of distances, which can be achieved by step by step maximizing it in a build-up process, i.e., in every step, the position of the atom is determined to maximize the joint probability of its associated distances with the determined atoms. Work along this direction may have great applications in structure prediction. I will not get into too much details due to the time limit for the talk. j structure prediction


Download ppt "A Novel Geometric Build-Up Algorithm"

Similar presentations


Ads by Google