Download presentation
Presentation is loading. Please wait.
Published bySherman Wilcox Modified over 9 years ago
1
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇
2
Introduction More than 18,000 protein structures stored in PDB (September 2002) Structural comparison(3D) and database searching – other methods practice exhaustive searching Their design philosophy: Filter-and-refine Filter-and-refine Using Indexed-based searching method Using Indexed-based searching method Results: 16 times faster than DALI
3
Filter-and-Refine ProtDex Actual alignment query Database 20,000 proteins result Top 100 proteins
4
Problem Definition Protein Structures 3D Structural Comparison Structural Database Searching
5
A protein is composed of a sequence of amino acid (AA) residues. SSE – secondary structure element (ex. helices, sheets) Loop Regions (no specific shape)
6
Sequence Comparison vs. Structural Comparison One cannot determine the similarity of two remotely homologous proteins by sequence comparison. We try to superimpose one protein structure over another in order to obtain the minimum root mean square deviation (RMSD) between them. -> O(n 4 m 4 )
7
The ProtDex Method Step 1: Extracting Information from PDB database Step 2: Building Intra-molecular Distance Matrices Design rationale: Design rationale: two protein structures are similar if their distance matrices are similar Step 3: Cutting Fixed Matrices and Extracting Properties Step 4: Building Inverted File Index
8
Step 1: Extracting Information For each protein chain in PDB file: PDB id - chain id; No. of AA residues; No. of SSEs For each AA Residue: 3D coordinate (x, y, z) of C carbon For each SSE: SSE type (Helix or Sheet); SSE Start position; SSE length
9
Step 2: Representation - Building Distance Matrices Protein 9xxxx with 7 AA residues
10
Step 3-1: Contact Patterns & Fixed- Size Matrices SSE(H)SSE(E) contact patterns Fixed-size matrix
11
Step 3-2: Extracting Properties For the 2X2 sub-matrix starting at the cell (2, 2), we store the values: 8, HH, (3,3), (1,1), (1,1) For the 2X2 sub-matrix starting at the cell (3,6), we store the values: 49, HE, (3,2), (1,2), (2,1), etc.
12
Step 4: Building Inverted File Index Implemented as sorted list
13
Searching a Protein Structure S(Q,P) = W FMCount (Q,P) X W GSum (I,j) X Sigma (match(I,j) [ (W Term (i) X max (match(a,b)^PdbIdb=P) ( W Area (a,b) X W ARatio (a,b) X W Ordinal (a,b) ) ] W FMCount is to compensate the effect that the large proteins being matched and scored more frequently than the small ones. W Term is to add more weight to the query index terms that rarely occur in the database.
15
Discussion Design: representation of structures representation of structures scoring schemes scoring schemes comparison algorithms comparison algorithms assessment of the results assessment of the resultsPerformance Accuracy – SCOP Accuracy – SCOP classification hierarchy is made of 4 levels: class, fold, superfamily and family Pros and Cons of ProtDex
18
Conclusions Advantages: Speed (need not to scan through each structure in the database) Speed (need not to scan through each structure in the database)Disadvantages: Cannot provide the actual alignment Cannot provide the actual alignment Storage overhead for the index structure (the entire index: 1.2GB) Storage overhead for the index structure (the entire index: 1.2GB) Time requirement to build and update the index (building the entire index: 30min 38 sec) Time requirement to build and update the index (building the entire index: 30min 38 sec)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.