Approximation of Protein Structure for Fast Similarity Measures Fabian Schwarzer Itay Lotan Stanford University.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.
Aggregating local image descriptors into compact codes
PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
4/15/2017 Using Gaussian Process Regression for Efficient Motion Planning in Environments with Deformable Objects Barbara Frank, Cyrill Stachniss, Nichola.
With thanks to Zhijun Wu An introduction to the algorithmic problems of Distance Geometry.
Eigenvalues and eigenvectors
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Principal Component Analysis
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Protein Structure Space Patrice Koehl Computer Science and Genome Center
Frederic Payan, Marc Antonini
SVD and PCA COS 323. Dimensionality Reduction Map points in high-dimensional space to lower number of dimensionsMap points in high-dimensional space to.
Algorithm for Fast MC Simulation of Proteins Itay Lotan Fabian Schwarzer Dan Halperin Jean-Claude Latombe.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.
Stochastic roadmap simulation for the study of ligand-protein interactions Mehmet Serkan Apaydin, Carlos E. Guestrin, Chris Varma, Douglas L. Brutlag and.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Presented by Arun Qamra
FLANN Fast Library for Approximate Nearest Neighbors
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Evaluating Performance for Data Mining Techniques
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:
Summarized by Soo-Jin Kim
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Conformational Space.  Conformation of a molecule: specification of the relative positions of all atoms in 3D-space,  Typical parameterizations:  List.
Non-Photorealistic Rendering and Content- Based Image Retrieval Yuan-Hao Lai Pacific Graphics (2003)
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Variations on Backpropagation.
Structural alignment methods Like in sequence alignment, try to find best correspondence: –Look at atoms –A 3-dimensional problem –No a priori knowledge.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures Rachel Kolodny Patrice Koehl Michael Levitt Stanford University.
Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology Iowa State University Joint Work with Tauqir Bibi, Feng Cui, Qunfeng.
Chapter 13 Discrete Image Transforms
Multivariate statistical methods Cluster analysis.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
K -Nearest-Neighbors Problem. cRMSD  cRMSD(c,c ’ ) is the minimized RMSD between the two sets of atom centers: min T [(1/n)  i=1, …,n ||a i (c) – T(a.
Principal Component Analysis (PCA)
Multivariate statistical methods
Data Mining K-means Algorithm
Clustering (3) Center-based algorithms Fuzzy k-means
Object Modeling with Layers
Variations on Backpropagation.
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Lecture 4 - Monte Carlo improvements via variance reduction techniques: antithetic sampling Antithetic variates: for any one path obtained by a gaussian.
Variations on Backpropagation.
Presentation transcript:

Approximation of Protein Structure for Fast Similarity Measures Fabian Schwarzer Itay Lotan Stanford University

Comparing Protein Structures vs. Same protein: Analysis of MDS and MCS trajectories Structure prediction applications Evaluating decoy sets Clustering predictions (Shortle et al, Biophysics ’98) Graph-based methods Stochastic Roadmap Simulation (Apaydin et al, RECOMB ’02)

k Nearest-Neighbors Problem Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c. Can be done in N – size of S L – time to compare two conformations

k Nearest-Neighbors Problem What if needed for all c in S ? - too much time Can be improved by: 1. Reducing L 2. A more efficient algorithm

Our Solution Reduce structure description Approximate but fast similarity measures Efficient nearest-neighbor algorithms can be used Reduce description further

Description of a Protein’s Structure 3n coordinates of C α atoms ( n – Number of residues)

Similarity Measures - cRMS The RMS of the distances between corresponding atoms after the two conformations are optimally aligned Computed in O(n) time

Similarity Measures - dRMS The Euclidean distance between the intra- molecular distances matrices of the two conformations Computed in O(n 2 ) time

m -Averaged Approximation Cut chain into m pieces Replace each sequence of n/m C α atoms by its centroid 3n coordinates 3m coordinates

Why m -Averaging? Averaging reduces description of random chains with small error Demonstrated through Haar wavelet analysis Protein backbones behave on average like random chains Chain topology Limited compactness

Evaluation: Test Sets 1. Decoy sets: conformations from the Park-Levitt set (Park & Levitt, JMB ’96), N = 10, Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00), N = structurally diverse proteins of size residues:

Decoy Sets Correlation m cRMS dRMS – – – – – – – – 0.97 Higher Correlation for random sets!

Speed-up for Decoy Sets Between 5X and 8X for cRMS ( m = 8 ) Between 9X and 36X for dRMS ( m = 12 ) with very small error For random sets the speed-up for dRMS was between 25X and 64X (m = 8)

Efficient Nearest-Neighbor Algorithms There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures: cRMS is not a Euclidean metric dRMS uses a space of dimensionality n(n-1)/2

Further Dimensionality Reduction of dRMS kd-trees require dimension  20 m -averaging with dRMS is not enough Reduce further using SVD SVD: A tool for principal component analysis. Computes directions of greatest variance.

Reduction Using SVD 1. Stack m -averaged distances matrices as vectors 2. Compute the SVD of entire set 3. Project onto most important singular vectors dRMS is thus reduced to  20 dimensions Without m -averaging SVD can be too costly

Testing the Method Use decoy sets ( N = 10,000 ) m -averaging with ( m = 16 ) Project onto 20 largest PCs (more than 95% of variance) Each conformation represented by 20 numbers

Results For k = 10, 25, 100 Decoy sets: ~80% correct furthest NN off by 10% - 20% (0.7Å – 1.5Å) 1CTF, with N = 100,000  similar results Random sets  90% correct with smaller error (5% - 10%) When precision is important use as pre- filter with larger k than needed

Running Time N = 100,000 k = 100, for each conformation Brute-force: ~84 hours Brute-force + m -averaging: ~4.8 hours Brute-force + m -averaging + SVD: 41 minutes Kd-tree + m -averaging + SVD: 19 minutes kd-trees will have more impact for larger sets

Structural Classification Computing the similarity between structures of two different proteins is more involved: The correspondence problem: Which parts of the two structures should be compared? 1IRD 2MM1 vs.

STRUCTAL (Gerstein & Levitt ’98) 1. Compute optimal correspondence using dynamic programming 2. Optimally align the corresponding parts in space to minimize cRMS 3. Repeat until convergence Result depends on initial correspondence! O(n 1 n 2 ) time

STRUCTAL + m -averaging Compute similarity for structures of same SCOP super-family with and without m -averaging n/m correlation – – – 0.57 NN results were disappointing speed-up ~7 ~19 ~46

Conclusion Fast computation of similarity measures Trade-off between speed and precision Exploits chain topology and limited compactness of proteins Allows use of efficient nearest-neighbor algorithms Can be used as pre-filter when precision is important

Random Chains c0c0 c1c1 c2c2 c3c3 c4c4 c5c5 c n-1 c6c6 c7c7 c8c8 The dimensions are uncorrelated Average behavior can be approximated by normal variables:

1-D Haar Wavelet Transform Recursive averaging and differencing of the values Level Averages Detail Coefficients [ ] [ ] [ 6 4 ] [ 5 ] [ ] [ ] [ 1 ] [ ][ ]

Haar Wavelets and Compression Compress by discarding smallest coefficients When discarding detail coefficients the approximation error is the root of the sum of the squares of the discarded coefficients

Transform of Random Chains m-averaging (m = 2 v ) Discarding lowest levels of detail coeeficients  For random chains the pdf of the detail coefficients is: Coefficients expected to be ordered! Discard coefficients starting at lowest level

Random Chains and Proteins