Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Hoksza, Supervisor: Tomáš Skopal, KSI MFF UK Similarity Search in Protein Databases.

Similar presentations


Presentation on theme: "David Hoksza, Supervisor: Tomáš Skopal, KSI MFF UK Similarity Search in Protein Databases."— Presentation transcript:

1 David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases

2 Thesis Contributions Protein sequence similarity  metric indexing approach  sequential approach  speedup within existing similarity model Protein structure similarity  nonalignment-based similarity approach  alignment-based similarity approach  similarity modelling itself

3 Protein structure & Motivation Transportation, building, signalling, catabolism,... Molecule consisting of 20 types of amino acids (AA) Central dogma of molecular biology  DNA → RNA → protein Proteins’ 3D interactions secure biological function protein structure similarity → biological function similarity protein sequence similarity → protein structure similarity Source: http://en.wikipedia.org/wiki/File:Main_protein_ structure_levels_en.svg

4 Protein Sequence Similarity Edit distance  minimal number of editing operations (insert/update/delete) needed for converting one sequence to the other  alignment Weighted edit distance  scoring matrix (PAM, BLOSSUM, …) amino acids’ mutation probability  affine gap penalty system  global alignment dynamic programming - Needleman-Wunsch (NW) algorithm  local alignment dynamic programming - Smith-Waterman (SW) algorithm the locality leads to nonmetric measures

5 Metric Indexing Approach Using metric access methods (MAMs)  organize objects into regions for effective filtering  require metric distance function Distance function  E-value with SW standard in biological databases statistical relevance  dealing with reflexivity, symmetry, triangular inequality (Trigen) D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007

6 Metric Indexing Approach D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007 Swissprot database 3,000 DB sequences 100 query sequences

7 Sequence-Based Approach Speeding up distance computation itself Skipping common parts in the DP matrix  context dependency problem  prefixes database sort skipping common prefixes  prefix ratio (PR) = speedup  suffixes  database division D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS

8 Sequence-Based Approach D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS Uniprot database

9 Protein Structure Similarity Nonalignment-based approach  indexing  feature extraction  simple or ad-hoc similarity measure Alignment-based approach  sequential scan  feature extraction  alignment  superposition (3D transformation)  similarity measure RMSD, TM-score Quality criterion  classification accuracy (SCOP hierarchical classification)

10 Density-Based Feature Extraction D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE Features  n-dimensional vectors of real numbers  AA ≈ viewpoint → VPT (viewpoint tag) sDens  density of AAs in rings of predefined width sRad  widths of rings containing predefined percentage of AAs

11 Nonalignment-Based Approach One-step search o database creation 1. AAs → feature vectors 2. indexing using weighted L 2 metric and MAM o querying 1. AAs → feature vectors 2. feature vector → query object 3. results’ merging 4. SCOP classification Two-step search  2 one-step searches  results’ comparison  rescoring using Smith-Waterman  SCOP classification D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE

12 Alignment-Based Approach D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using Iterative TM-score, BIBMW 2009, IEEE D. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010

13 Alignment-Based Approach Feature extraction  AAs → feature vectors  density-based feature extraction Alignment (amino acid matching)  Smith-Waterman alignment distance between feature vectors → scoring matrix modified variable gap penalty system Superposition + scoring  RMSD  TM-score reducing number of initial states iterative dynamic programming with belt-based restriction D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using Iterative TM-score, BIBMW 2009, IEEE D. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010

14 Alignment-Based Approach - Indexing Indexability measures  intrinsic dimensionality objects’ distribution  ball overlap factor (BOF) ball regions’ separation  T-error nontriangular triplets Semimetrization    reranking Metrization  Trigen J. Galgonek, D. Hoksza. On the Effectiveness of Distances Measuring Protein Structure Similarity, SISAP 2009, IEEE

15 Conclusion Protein sequence search  indexing data domain exploration  speeding distance computation 20% speedup expected speedup growth Protein structure search  nonalignment-based best accuracy on all SCOP levels  alignment-based best accuracy on superfamily and fold SCOP levels indexing possibilities comparison with nonalignment-based

16 Publications D. Hoksza, T. Skopal Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007 D. Hoksza Improved Alignment of Protein Sequences Based on Common Parts ISBRA 2008, LNCS D. Hoksza DDPIn: Distance and Density-Based Protein Indexing CIBCB 2009, IEEE D. Hoksza, J. Galgonek Density-Based Classification of Protein Structures Using Iterative TM-score BIBMW 2009, IEEE J. Galgonek, D. Hoksza On the Effectiveness of Distances Measuring Protein Structure Similarity SISAP 2009, IEEE D. Hoksza, J. Galgonek Alignment-Based Extension to DDPIn Feature Extraction IJCB, ACTA Press, 2010


Download ppt "David Hoksza, Supervisor: Tomáš Skopal, KSI MFF UK Similarity Search in Protein Databases."

Similar presentations


Ads by Google