Download presentation
Presentation is loading. Please wait.
Published byMorgan Carroll Modified over 9 years ago
1
David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases
2
Thesis Contributions Protein sequence similarity metric indexing approach sequential approach speedup within existing similarity model Protein structure similarity nonalignment-based similarity approach alignment-based similarity approach similarity modelling itself
3
Protein structure & Motivation Transportation, building, signalling, catabolism,... Molecule consisting of 20 types of amino acids (AA) Central dogma of molecular biology DNA → RNA → protein Proteins’ 3D interactions secure biological function protein structure similarity → biological function similarity protein sequence similarity → protein structure similarity Source: http://en.wikipedia.org/wiki/File:Main_protein_ structure_levels_en.svg
4
Protein Sequence Similarity Edit distance minimal number of editing operations (insert/update/delete) needed for converting one sequence to the other alignment Weighted edit distance scoring matrix (PAM, BLOSSUM, …) amino acids’ mutation probability affine gap penalty system global alignment dynamic programming - Needleman-Wunsch (NW) algorithm local alignment dynamic programming - Smith-Waterman (SW) algorithm the locality leads to nonmetric measures
5
Metric Indexing Approach Using metric access methods (MAMs) organize objects into regions for effective filtering require metric distance function Distance function E-value with SW standard in biological databases statistical relevance dealing with reflexivity, symmetry, triangular inequality (Trigen) D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007
6
Metric Indexing Approach D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007 Swissprot database 3,000 DB sequences 100 query sequences
7
Sequence-Based Approach Speeding up distance computation itself Skipping common parts in the DP matrix context dependency problem prefixes database sort skipping common prefixes prefix ratio (PR) = speedup suffixes database division D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS
8
Sequence-Based Approach D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS Uniprot database
9
Protein Structure Similarity Nonalignment-based approach indexing feature extraction simple or ad-hoc similarity measure Alignment-based approach sequential scan feature extraction alignment superposition (3D transformation) similarity measure RMSD, TM-score Quality criterion classification accuracy (SCOP hierarchical classification)
10
Density-Based Feature Extraction D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE Features n-dimensional vectors of real numbers AA ≈ viewpoint → VPT (viewpoint tag) sDens density of AAs in rings of predefined width sRad widths of rings containing predefined percentage of AAs
11
Nonalignment-Based Approach One-step search o database creation 1. AAs → feature vectors 2. indexing using weighted L 2 metric and MAM o querying 1. AAs → feature vectors 2. feature vector → query object 3. results’ merging 4. SCOP classification Two-step search 2 one-step searches results’ comparison rescoring using Smith-Waterman SCOP classification D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE
12
Alignment-Based Approach D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using Iterative TM-score, BIBMW 2009, IEEE D. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010
13
Alignment-Based Approach Feature extraction AAs → feature vectors density-based feature extraction Alignment (amino acid matching) Smith-Waterman alignment distance between feature vectors → scoring matrix modified variable gap penalty system Superposition + scoring RMSD TM-score reducing number of initial states iterative dynamic programming with belt-based restriction D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using Iterative TM-score, BIBMW 2009, IEEE D. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010
14
Alignment-Based Approach - Indexing Indexability measures intrinsic dimensionality objects’ distribution ball overlap factor (BOF) ball regions’ separation T-error nontriangular triplets Semimetrization reranking Metrization Trigen J. Galgonek, D. Hoksza. On the Effectiveness of Distances Measuring Protein Structure Similarity, SISAP 2009, IEEE
15
Conclusion Protein sequence search indexing data domain exploration speeding distance computation 20% speedup expected speedup growth Protein structure search nonalignment-based best accuracy on all SCOP levels alignment-based best accuracy on superfamily and fold SCOP levels indexing possibilities comparison with nonalignment-based
16
Publications D. Hoksza, T. Skopal Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007 D. Hoksza Improved Alignment of Protein Sequences Based on Common Parts ISBRA 2008, LNCS D. Hoksza DDPIn: Distance and Density-Based Protein Indexing CIBCB 2009, IEEE D. Hoksza, J. Galgonek Density-Based Classification of Protein Structures Using Iterative TM-score BIBMW 2009, IEEE J. Galgonek, D. Hoksza On the Effectiveness of Distances Measuring Protein Structure Similarity SISAP 2009, IEEE D. Hoksza, J. Galgonek Alignment-Based Extension to DDPIn Feature Extraction IJCB, ACTA Press, 2010
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.