DALI Method Distance mAtrix aLIgnment

Slides:



Advertisements
Similar presentations
Types of Algorithms.
Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Chapter 7 – Classification and Regression Trees
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Structural bioinformatics
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.
Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
4. Ad-hoc I: Hierarchical clustering
Reduced Support Vector Machine
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Appendix: Automated Methods for Structure Comparison Basic problem: how are any two given structures to be automatically compared in a meaningful way?
The Protein Data Bank (PDB)
Protein Tertiary Structure Comparison Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Protein threading Structure is better conserved than sequence
Protein structure prediction May 30, 2002 Quiz#4 on June 4 Learning objectives-Understand difference between primary secondary and tertiary structure.
BMI 731 Protein Structures and Related Database Searches.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Structure Prediction and Analysis
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
1 Randomized Algorithms for Three Dimensional Protein Structures Comparison Yaw-Ling Lin Dept Computer Sci and Info Engineering, Providence University,
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Chapter 9 – Classification and Regression Trees
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
Slides for “Data Mining” by I. H. Witten and E. Frank.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches.
Chapter 14 Protein Structure Classification
Bayesian Refinement of Protein Functional Site Matching
Protein Structures.
Text Categorization Berlin Chen 2003 Reference:
DALI Method Distance mAtrix aLIgnment
Presentation transcript:

DALI Method Distance mAtrix aLIgnment Liisa Holm and Chris Sander, “Protein structure comparison by alignment of distance matrices”, Journal of Molecular Biology Vol. 233, 1993. Liisa Holm and Chris Sander, “Mapping the protein universe”, Science Vol. 273, 1996. Liisa Holm and Chris Sander, “Alignment of three-dimensional protein structures: network server for database searching”, Methods in Enzymology Vol. 266, 1996.

How DALI Works? Based on fact: similar 3D structures have similar intra-molecular distances. Background idea Represent each protein as a 2D matrix storing intra-molecular distance. Place one matrix on top of another and slide vertically and horizontally – until a common the sub-matrix with the best match is found. Actual implementation Break each matrix into small sub-matrices of fixed size. Pair-up similar sub-matrices (one from each protein). Assemble the sub-matrix pairs to get the overall alignment. Protein A Protein B

Structure Representation of DALI 3D shape is described with a distance matrix which stores all intra-molecular distances between the Cα atoms. Distance matrix is independent of coordinate frame. Contains enough information to re-construct the 3D coordinates. Protein A Distance matrix for 2drpA and 1bbo Distance matrix for Protein A 1 2 3 4 1 2 3 4 d12 d13 d14 d23 d24 d34

Intra-molecular distance for myoglobin

DALI Algorithm Decompose distance matrix into elementary contact patterns (sub-matrices of fixed size) Use hexapeptide-hexapeptide contact patterns. Compare contact patterns (pair-wise), and store the matching pairs in pair list. Assemble pairs in the correct order to yield the overall alignment.

Assembly of Alignments Non-trivial combinatory problem. Assembled in the manner (AB) – (A’B’), (BC) – (B’C’), . . . (i.e., having one overlapping segment with the previous alignment) Available Alignment Methods: Monte Carlo optimization Brach-and-bound Neighbor walk

Schematic View of DALI Algorithm 3D (Spatial) 2D (Distance Matrix) 1D (Sequence)

Monte Carlo Optimization Used in the earlier versions of DALI. Algorithm Compute a similarity score for the current alignment. Make a random trial change to the current alignment (adding a new pair or deleting an existing pair). Compute the change in the score (S). If S > 0, the move is always accepted. If S <= 0, the move may be accepted by the probability exp(β * S), where β is a parameter. Once a move is accepted, the change in the alignment becomes permanent. This procedure is iterated until there is no further change in the score, i.e., the system is converged.

Branch-and-bound method Used in the later versions of DALI. Based on Lathrop and Smith’s (1996) threading (sequence-structure alignment) algorithm. Solution space consists of all possible placements of residues in protein A relative to the segment of residues of protein B. The algorithm recursively split the solution space that yields the highest upper bound of the similarity score until there is a single alignment trace left.

LOCK Uses a hierarchical approach Larger secondary structures such as helixes and strands are represented using vectors and dealt with first Atoms are dealt with afterwards Assumes large secondary structures provide most stability and function to a protein, and are most likely to be preserved during evolution

LOCK (Contd.) Key algorithm steps: Represent secondary structures as vectors Obtain initial superposition by computing local alignment of the secondary structure vectors (using dynamic programming) Compute atomic superposition by performing a greedy search to try to minimize root mean square deviation (a RMS distance measure) between pairs of nearest atoms from the two proteins Identify “core” (well aligned) atoms and try to improve their superposition (possibly at the cost of degrading superposition of non-core atoms) Steps 2, 3, and 4 require iteration at each step

Alignment of SSEs Define an orientation-dependent score and an orientation-independent score between SSE vectors. For every pair of query vectors, find all pairs of vectors in database protein that align with a score above a threshold. Two of these vectors must be adjacent. Use orientation independent scores. For each set of four vectors from previous step, find the transformation minimizing rmsd. Apply this transformation to the query. Run dynamic programming using both orientation-dependent and orientation-independent scores to find the best local alignment. Compute and apply the transformation from the best local alignment. Superpose in order to minimize rmsd.

Atomic superposition Loop until rmsd does not change find matching pairs of Ca atoms use only those within 3 A find best alignment until rmsd does not change

Core identification Loop until rmsd does not change find the best core (symmetric nns) and align; remove the rest until rmsd does not change

VAST Begin with a set of nodes (a,x) where SSEs a and x are of the same type Add an edge between (a,x) and (b,y) if angle and distance between (a,b) is same as between (x,y) Find the maximal clique in this graph; this forms the initial SSE alignment Extend the initial alignment to Ca atoms using Gibbs sampling Report statistics on this match

Quality of a structure match Statistical theory similar to BLAST Compare the likelihood of a match as compared to a random match Less agreement regarding score matrix z-scores of CE, DALI, and VAST may not be compatible

Protein Structure Classification CATH SCOP FSSP Up-to-date view of the protein structure universe SCOP is updated every six months. Determining SCOP classifications of protein structures automatically as they are published in Protein Data Bank (PDB).

Problem definition ? ? SCOP Classification root new protein structure fold fold fold superfamily superfamily family ? family family family family ?

Two problems Class membership? Class label assignment? Does the query protein belong to a SCOP category? Or does it need a new category to be defined? Binary classification problem: member, non-member Class label assignment? What SCOP category is the query protein assigned to? Multi-class classification problem

Hierarchical classification Let p be a protein structure, proceed bottom-up from family level to fold level: Does p belong to a family? report family yes Does p belong to a superfamily? no report superfamily yes Does p belong to a fold? no report fold yes new fold no

Component classifiers Using a sequence/structure comparison tool as a classifier Perform a nearest neighbor query: if similarityScore(query, NN) < trained cutoff then not a member of any category else member of class(NN) Comparison tools we have used: Sequence: PSI-Blast, HMMER+SUPERFAMILY database Structure: CE, Dali, Vast

Performance of component classifiers Database: SCOP 1.59 Query: SCOP 1.61 – SCOP 1.59 Class membership HMM BLAST CE Dali Vast At least one family 94.5% 92.6% 89% 98.2% superfamily 78.6% 66.1% 72.2% 77.6% 78.4% 96% fold 73% 60.7% 78.5% 82% 85% 100%

Performance of component classifiers Database: SCOP 1.59 Query: SCOP 1.61 – SCOP 1.59 Class label assignment HMM BLAST CE Dali Vast At least one family 94.8% 92.3% 91% 88% 92% 97.9% superfamily 69% 12% 81% 80.4% 81.7% 93.9% fold 40.5% 0% 46% 54% 64.9%

Normalization of similarity scores Universal confidence levels instead of tool-specific scores Perform nearest neighbor queries Database: SCOP 1.59 Query: SCOP 1.61 – SCOP 1.59 Partition score space of tools into confidence levels e.g. CE z-score of 5.4  we are 80% confident that the query protein is a member of an existing fold.

Consensus Decision Each component classifier reports a confidence level for the query protein: c = [C1, C2, C3, C4, C5] What is the best way to combine these probabilistic decisions? A solution: decision trees. Decision trees: Attribute order? Branching factor?

Proposed decision tree structure < θ11 > θ21 else L1 C2 L2 < θ12 > θ22 else L1 Cn L2 < θ1n > θ2n L1 L2

Determination of Cis and θjis Automated Generate all possible trees of height 3 and Cis as sum rules of up to 3 components. Determine θjis using a greedy optimization that minimizes impurities of nodes level by level. Disadvantage: overfits the data Manual Determine Cis by examining individual component’s performances Determine θjis considering two levels of the tree simultaneously and considering only the values between score clusters to avoid overfitting.

decision tree: superfamily level Vast? < 45% > 93% else new superfamily HMM? existing superfamily < 40% > 75% else new superfamily CE+Dali? existing superfamily < 55% >= 55% existing superfamily new superfamily

Experimental evaluation The dataset: Training Evaluation Database v1.59 (20449) v1.61 (22724) Query v1.61 – v1.59 (2241) v1.63 – v1.59 (2825) new family 248 618 new superfamily 84 424 new fold 47 339

Training: class membership

Testing: class membership

Training: class label assignment

Testing: class label assignment