An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇

Introduction More than 18,000 protein structures stored in PDB (September 2002) Structural comparison(3D) and database searching – other methods practice exhaustive searching Their design philosophy: Filter-and-refine Filter-and-refine Using Indexed-based searching method Using Indexed-based searching method Results: 16 times faster than DALI

Filter-and-Refine ProtDex Actual alignment query Database 20,000 proteins result Top 100 proteins

Problem Definition Protein Structures 3D Structural Comparison Structural Database Searching

A protein is composed of a sequence of amino acid (AA) residues. SSE – secondary structure element (ex. helices, sheets) Loop Regions (no specific shape)

Sequence Comparison vs. Structural Comparison One cannot determine the similarity of two remotely homologous proteins by sequence comparison. We try to superimpose one protein structure over another in order to obtain the minimum root mean square deviation (RMSD) between them. -> O(n 4 m 4 )

The ProtDex Method Step 1: Extracting Information from PDB database Step 2: Building Intra-molecular Distance Matrices Design rationale: Design rationale: two protein structures are similar if their distance matrices are similar Step 3: Cutting Fixed Matrices and Extracting Properties Step 4: Building Inverted File Index

Step 1: Extracting Information For each protein chain in PDB file: PDB id - chain id; No. of AA residues; No. of SSEs For each AA Residue: 3D coordinate (x, y, z) of C carbon For each SSE: SSE type (Helix or Sheet); SSE Start position; SSE length

Step 2: Representation - Building Distance Matrices Protein 9xxxx with 7 AA residues

Step 3-1: Contact Patterns & Fixed- Size Matrices SSE(H)SSE(E) contact patterns Fixed-size matrix

Step 3-2: Extracting Properties For the 2X2 sub-matrix starting at the cell (2, 2), we store the values: 8, HH, (3,3), (1,1), (1,1) For the 2X2 sub-matrix starting at the cell (3,6), we store the values: 49, HE, (3,2), (1,2), (2,1), etc.

Step 4: Building Inverted File Index Implemented as sorted list

Searching a Protein Structure S(Q,P) = W FMCount (Q,P) X W GSum (I,j) X Sigma (match(I,j) [ (W Term (i) X max (match(a,b)^PdbIdb=P) ( W Area (a,b) X W ARatio (a,b) X W Ordinal (a,b) ) ] W FMCount is to compensate the effect that the large proteins being matched and scored more frequently than the small ones. W Term is to add more weight to the query index terms that rarely occur in the database.

Discussion Design: representation of structures representation of structures scoring schemes scoring schemes comparison algorithms comparison algorithms assessment of the results assessment of the resultsPerformance Accuracy – SCOP Accuracy – SCOP classification hierarchy is made of 4 levels: class, fold, superfamily and family Pros and Cons of ProtDex

Conclusions Advantages: Speed (need not to scan through each structure in the database) Speed (need not to scan through each structure in the database)Disadvantages: Cannot provide the actual alignment Cannot provide the actual alignment Storage overhead for the index structure (the entire index: 1.2GB) Storage overhead for the index structure (the entire index: 1.2GB) Time requirement to build and update the index (building the entire index: 30min 38 sec) Time requirement to build and update the index (building the entire index: 30min 38 sec)

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

Similar presentations

Presentation on theme: "An Efficient Index-based Protein Structure Database Searching Method 陳冠宇."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

Similar presentations

Presentation on theme: "An Efficient Index-based Protein Structure Database Searching Method 陳冠宇."— Presentation transcript:

Similar presentations

About project

Feedback