DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic
CIBCB Presentation Outline Biological background Similarity search in protein structure databases DDPIn feature vector extraction metrics querying one-step approach multi-step approach Experimental results Conclusion
CIBCB Biological Background Proteins molecules translated from mRNA in ribosomes DNA → RNA → protein sequence of amino acids (20 AAs) coded by codon (triplet of nucleotides) Function of a protein derived from its three dimensional structure → similar proteins have similar functions similar proteins have a common ancestor Identifying protein structure → finding similar proteins → getting clue to the function
CIBCB Similarity Search in Protein Databases Similarity between a pair of proteins alignment + similarity score RMSD, TM-score, … visual inspection DALI, CE, SAP, VAST… Classification SCOP (Structural Classification of Proteins) SCOP no need for an alignment indexing various features PSI, PSIST, ProGreSS, CTSS, …DDPIn
CIBCB DDPIn - Overview Distance and Density based Protein Indexing Classification method Indexing of protein features distances among C α atoms used each AA represents a feature → protein p consists of |p| features various semantics used based on clustering C α atoms into rings metric indexing employed (M-tree) kNN querying outcomes of several searches are merged to obtain final results
CIBCB DDPIn - Feature Extraction Features n-dimensional vectors of real numbers AA ≈ viewpoint → VPT (viewpoint tag) sDens density of AAs in rings with a predefined width sDensSSE enhanced with SSE information sRad widths of rings containing predefined percentage of AAs sRadSSE enhanced with SSE information sDir number of AAs in a ring pointing from the viepoint sDens enhanced with direction information
CIBCB Metrics L2L2 weighted L 2 close neighborhood of VPs is more important DDPIn - Similarity of VPTs
CIBCB DDPIn – Indexing Structure M-tree (Metric tree) Dynamic, hierarchical indexing structure Data space divided into ball shaped data regions (hyper-spheres) root node represent data region covering all data children nodes represent regions covering parts of the space, … data regions form balanced hierarchical structure inner nodes → routing entries leaf nodes → ground entries
CIBCB Querying / Classification One-step extracting VPTs from query → n queries ranking scheme Two-step healing reclassification with Smith- Waterman algorithm on sequences
CIBCB Experimental Results SCOP 1.65 dataset class → fold → superfamily → family 1810 proteins 181 superfamilies at least 10 proteins each all α, all β, α + β and α /β classes query set reduced queries full used also by PSI, ProGreSS, PSIST methods Testing of superfamily classification accuracy fold classification accuracy
CIBCB Finding Optimal k for kNN Queries
CIBCB Accuracy of VPT Semantics
CIBCB Accuracy for Increasing Dimension
CIBCB Accuracy of Various Metrics
CIBCB Suitability of Pairs of VPT Semantics for Healing identical correct classification identical wrong classification
CIBCB Comparison of Classification Methods
CIBCB Conclusion We have proposed new representation of protein structures distance and density of C α atoms ranking scheme two-step classification We implemented M-tree indexing for proposed representation classification against SCOP Experimental results best results among methods using identical classification 98.9% superfamily classification accuracy 100% fold classification accuracy comparable run time