Presentation is loading. Please wait.

Presentation is loading. Please wait.

DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Similar presentations


Presentation on theme: "DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic."— Presentation transcript:

1 DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic

2 CIBCB 20092 Presentation Outline Biological background Similarity search in protein structure databases DDPIn  feature vector extraction  metrics  querying one-step approach multi-step approach Experimental results Conclusion

3 CIBCB 20093 Biological Background Proteins  molecules  translated from mRNA in ribosomes DNA → RNA → protein  sequence of amino acids (20 AAs)  coded by codon (triplet of nucleotides) Function of a protein derived from its three dimensional structure  → similar proteins have similar functions  similar proteins have a common ancestor Identifying protein structure → finding similar proteins → getting clue to the function

4 CIBCB 20094 Similarity Search in Protein Databases Similarity between a pair of proteins  alignment + similarity score RMSD, TM-score, … visual inspection  DALI, CE, SAP, VAST… Classification  SCOP (Structural Classification of Proteins) SCOP  no need for an alignment  indexing various features  PSI, PSIST, ProGreSS, CTSS, …DDPIn

5 CIBCB 20095 DDPIn - Overview Distance and Density based Protein Indexing Classification method Indexing of protein features  distances among C α atoms used  each AA represents a feature → protein p consists of |p| features various semantics used  based on clustering C α atoms into rings  metric indexing employed (M-tree) kNN querying  outcomes of several searches are merged to obtain final results

6 CIBCB 20096 DDPIn - Feature Extraction Features  n-dimensional vectors of real numbers  AA ≈ viewpoint → VPT (viewpoint tag) sDens  density of AAs in rings with a predefined width  sDensSSE enhanced with SSE information sRad  widths of rings containing predefined percentage of AAs  sRadSSE enhanced with SSE information sDir  number of AAs in a ring pointing from the viepoint  sDens enhanced with direction information

7 CIBCB 20097 Metrics L2L2  weighted L 2 close neighborhood of VPs is more important DDPIn - Similarity of VPTs

8 CIBCB 20098 DDPIn – Indexing Structure M-tree (Metric tree) Dynamic, hierarchical indexing structure Data space divided into ball shaped data regions (hyper-spheres)  root node represent data region covering all data children nodes represent regions covering parts of the space, …  data regions form balanced hierarchical structure inner nodes → routing entries  leaf nodes → ground entries 

9 CIBCB 20099 Querying / Classification One-step  extracting VPTs from query → n queries  ranking scheme Two-step  healing  reclassification with Smith- Waterman algorithm on sequences

10 CIBCB 200910 Experimental Results SCOP 1.65 dataset  class → fold → superfamily → family  1810 proteins 181 superfamilies  at least 10 proteins each  all α, all β, α + β and α /β classes query set  reduced - 181 queries  full  used also by PSI, ProGreSS, PSIST methods Testing of  superfamily classification accuracy  fold classification accuracy

11 CIBCB 200911 Finding Optimal k for kNN Queries

12 CIBCB 200912 Accuracy of VPT Semantics

13 CIBCB 200913 Accuracy for Increasing Dimension

14 CIBCB 200914 Accuracy of Various Metrics

15 CIBCB 200915 Suitability of Pairs of VPT Semantics for Healing identical correct classification identical wrong classification

16 CIBCB 200916 Comparison of Classification Methods

17 CIBCB 200917 Conclusion We have proposed  new representation of protein structures distance and density of C α atoms ranking scheme two-step classification We implemented  M-tree indexing for proposed representation  classification against SCOP Experimental results  best results among methods using identical classification 98.9% superfamily classification accuracy 100% fold classification accuracy  comparable run time


Download ppt "DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic."

Similar presentations


Ads by Google