Download presentation
Presentation is loading. Please wait.
Published byElwin Robertson Modified over 8 years ago
1
Approximation of Protein Structure for Fast Similarity Measures Fabian Schwarzer Itay Lotan Stanford University
2
Comparing Protein Structures vs. Same protein: Analysis of MDS and MCS trajectories http://folding.stanford.edu Structure prediction applications Evaluating decoy sets Clustering predictions (Shortle et al, Biophysics ’98) Graph-based methods Stochastic Roadmap Simulation (Apaydin et al, RECOMB ’02)
3
k Nearest-Neighbors Problem Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c. Can be done in N – size of S L – time to compare two conformations
4
k Nearest-Neighbors Problem What if needed for all c in S ? - too much time Can be improved by: 1. Reducing L 2. A more efficient algorithm
5
Our Solution Reduce structure description Approximate but fast similarity measures Efficient nearest-neighbor algorithms can be used Reduce description further
6
Description of a Protein’s Structure 3n coordinates of C α atoms ( n – Number of residues)
7
Similarity Measures - cRMS The RMS of the distances between corresponding atoms after the two conformations are optimally aligned Computed in O(n) time
8
Similarity Measures - dRMS The Euclidean distance between the intra- molecular distances matrices of the two conformations Computed in O(n 2 ) time
9
m -Averaged Approximation Cut chain into m pieces Replace each sequence of n/m C α atoms by its centroid 3n coordinates 3m coordinates
10
Why m -Averaging? Averaging reduces description of random chains with small error Demonstrated through Haar wavelet analysis Protein backbones behave on average like random chains Chain topology Limited compactness
11
Evaluation: Test Sets 1. Decoy sets: conformations from the Park-Levitt set (Park & Levitt, JMB ’96), N = 10,000 2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00), N = 5000 9 structurally diverse proteins of size 38 -76 residues:
12
Decoy Sets Correlation m cRMS dRMS 8 12 16 20 4 0.37 – 0.73 0.84 – 0.98 0.98 – 0.99 0.40 – 0.86 0.70 – 0.94 0.92 – 0.96 0.92 – 0.98 0.93 – 0.97 Higher Correlation for random sets!
13
Speed-up for Decoy Sets Between 5X and 8X for cRMS ( m = 8 ) Between 9X and 36X for dRMS ( m = 12 ) with very small error For random sets the speed-up for dRMS was between 25X and 64X (m = 8)
14
Efficient Nearest-Neighbor Algorithms There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures: cRMS is not a Euclidean metric dRMS uses a space of dimensionality n(n-1)/2
15
Further Dimensionality Reduction of dRMS kd-trees require dimension 20 m -averaging with dRMS is not enough Reduce further using SVD SVD: A tool for principal component analysis. Computes directions of greatest variance.
16
Reduction Using SVD 1. Stack m -averaged distances matrices as vectors 2. Compute the SVD of entire set 3. Project onto most important singular vectors dRMS is thus reduced to 20 dimensions Without m -averaging SVD can be too costly
17
Testing the Method Use decoy sets ( N = 10,000 ) m -averaging with ( m = 16 ) Project onto 20 largest PCs (more than 95% of variance) Each conformation represented by 20 numbers
18
Results For k = 10, 25, 100 Decoy sets: ~80% correct furthest NN off by 10% - 20% (0.7Å – 1.5Å) 1CTF, with N = 100,000 similar results Random sets 90% correct with smaller error (5% - 10%) When precision is important use as pre- filter with larger k than needed
19
Running Time N = 100,000 k = 100, for each conformation Brute-force: ~84 hours Brute-force + m -averaging: ~4.8 hours Brute-force + m -averaging + SVD: 41 minutes Kd-tree + m -averaging + SVD: 19 minutes kd-trees will have more impact for larger sets
20
Structural Classification Computing the similarity between structures of two different proteins is more involved: The correspondence problem: Which parts of the two structures should be compared? 1IRD 2MM1 vs.
21
STRUCTAL (Gerstein & Levitt ’98) 1. Compute optimal correspondence using dynamic programming 2. Optimally align the corresponding parts in space to minimize cRMS 3. Repeat until convergence Result depends on initial correspondence! O(n 1 n 2 ) time
22
STRUCTAL + m -averaging Compute similarity for structures of same SCOP super-family with and without m -averaging n/m correlation 3 5 8 0.60 – 0.66 0.44 – 0.58 0.35 – 0.57 NN results were disappointing speed-up ~7 ~19 ~46
23
Conclusion Fast computation of similarity measures Trade-off between speed and precision Exploits chain topology and limited compactness of proteins Allows use of efficient nearest-neighbor algorithms Can be used as pre-filter when precision is important
24
Random Chains c0c0 c1c1 c2c2 c3c3 c4c4 c5c5 c n-1 c6c6 c7c7 c8c8 The dimensions are uncorrelated Average behavior can be approximated by normal variables:
25
1-D Haar Wavelet Transform Recursive averaging and differencing of the values Level Averages Detail Coefficients [ 9 7 2 6 5 1 4 6 ] [ 8 4 3 5 ] [ 6 4 ] [ 5 ] [ 1 -2 2 -1 ] [ -2 -1 ] [ 1 ] 3 2 1 0 [ 9 7 2 6 5 1 4 6 ][ 5 1 -2 -1 1 -2 2 1 ]
26
Haar Wavelets and Compression Compress by discarding smallest coefficients When discarding detail coefficients the approximation error is the root of the sum of the squares of the discarded coefficients
27
Transform of Random Chains m-averaging (m = 2 v ) Discarding lowest levels of detail coeeficients For random chains the pdf of the detail coefficients is: Coefficients expected to be ordered! Discard coefficients starting at lowest level
28
Random Chains and Proteins
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.