Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun Park JungIm Won {mkseo, sanghyun, Department of Computer Science, Yonsei University
Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work
Data and Knowledge Engineering Laboratory Our target Introduction Similarities in Protein Structure –Functional properties of proteins usually depend on structures of the proteins. –There are many proteins which are structurally similar, but their sequences are not similar at the level of amino acids. Amino Acids (AA) Loop Regions Secondary Structure Element (SSE): Helices and Sheets Polymer Chain Protein Polymer Chain …
Data and Knowledge Engineering Laboratory Introduction (cont’) Secondary Structure of Protein Sequences –Linear sequence of amino acids folds into three dimensional structures. –There are three basic types of folds: h : Alpha Helices e : Beta Sheets l : Turns of Loops –Normally, these types occur in groups. For example, ‘lhhheeeelll’ is more likely to occur than ‘helhehhll.’ Primary :GQISDSIEEKRGFFSTKR.. Secondary:HLLLLLLLLLLHHHEEEE.. Probability:
Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work
Data and Knowledge Engineering Laboratory Related Works BLAST and FASTA –These exhaustive search techniques are often adapted to run on specialized high-end hardware or parallelized. –They do not solve the fundamental problem of needing to compare a query to each sequence in the collection.
Data and Knowledge Engineering Laboratory Related Works (cont’) Segment Based Indexing L. Hammel and J. M. Patel, "Searching on the Secondary Structure of Protein Sequence," In Proc. VLDB Conference, 2002 –Indexing scheme is based on the idea of the segmentation of secondary protein sequences. –Index selectivity is not uniform. –Gap can be defined only as segments.
Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work
Data and Knowledge Engineering Laboratory Segment IDType Length Start Loc. 1H10 1L101 1H311 1E414 Segmentation Process H LLLLLLLLLL HHH EEEE H,1L,10H,3E, Type, Len: Loc : Protein ID :1 Primary :GQISDSIEEKRGFFSTKR.. Secondary :HLLLLLLLLLLHHHEEEE.. Probability: Sample Protein Sequence
Data and Knowledge Engineering Laboratory Segment (cont’) Selectivity of Type+Length –Index selectivity is not uniform.
Data and Knowledge Engineering Laboratory Cluster and Look Ahead ID Type Str Length Start Loc. Look Ahead 1h10lhe 1l101he 1h311e 1e414 IDTypeLength Start Loc. 1h10 1l101 1h311 1e414 ID Type Str Length Start Loc. Look Ahead 1hl110he 1lh131e 1he711 ID Type Str Length Start Loc. Look Ahead 1hlhe180 k=0 k=1 k=2 2 k segments are combined and put into Cluster table. By doing this, we index several characters instead of exactly one character. ‘Look Ahead’ fields contain ‘Type’ fields of next n segments. Maximum value of k can be limited according to query length.
Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work
Data and Knowledge Engineering Laboratory Index Selectivity Cluster table is searched by TypeStr+TypeLen+LookAhead. Segment table is searched by Type+TypeLen. Cluster indices are 30~3,000 times more selective than the segment index, though the exact length information is lost. Number of Proteins = 320,000
Data and Knowledge Engineering Laboratory LookAhead Selectivity LookAhead improves selectivity 2~34 times. Number of Proteins = 320,000
Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work
Data and Knowledge Engineering Laboratory Exact Match Query Compute k Clustering ① k=2 ② k=2 Sort Merge ID must be equal. Loc difference must be 1(= ). Post Processing Validate result using actual sequences. Searching ① TypeStr=hlhe, TypeLen= , LookAhead=l ② TypeStr=lhel, TypeLen=
Data and Knowledge Engineering Laboratory Range Match Clustering TypeStr=elhe, TypeLen=64( )~ 77( ) Searching Range of TypeLen is increased too much. Clustering Comparing Histograms ① ② ③ ④ Search the cluster which seems to return the smallest number of rows. ① k=2, TypeLen=64~77 ② k=1, TypeLen=53~63 ③ k=1, TypeLen=8~9 ④ k=1, TypeLen=11~14
Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work
Data and Knowledge Engineering Laboratory Environment Oracle Database on Linux –To store cluster table and cluster indices. JDK 1.4.2_04 on Windows XP –To run searching program. Protein Data Used –80,000 amino acid sequences of PDB(Protein Data Bank) were downloaded. –Secondary structure of protein sequences were predicted using PREDATOR. –320,000 sequences were populated by duplicating 80,000 sequences 4 times.
Data and Knowledge Engineering Laboratory Exact Match Our method is 3~20 times faster than MISS(2) and 8~49 times faster than SSS and ISS.
Data and Knowledge Engineering Laboratory Range Match Range Match, FirstRange Match, Middle Range Match, End SCRM is 1.2~6 times faster than MISS(2).
Data and Knowledge Engineering Laboratory Scalability Exact Match, |Q|=5 Range Match (Middle), |Q|=5 Our methods are 2~4 times faster than MISS(2).
Data and Knowledge Engineering Laboratory Index Size Index size increases linearly. Maximum K value can be set according to expected length of queries. In most cases, maximum K value of 2 will suffice. Number of proteins = 320,000
Data and Knowledge Engineering Laboratory Outline Introduction Related Works Basic Concept –Segment –Cluster and Look Ahead Selectivity Analysis Algorithm Description –Exact Match –Range Match Experiments Conclusion and Future Work
Data and Knowledge Engineering Laboratory Conclusion and Future Work We have proposed the concept of clustered indexing, which indexes fixed number of characters, and look ahead, which enhances index selectivity. Our experiments show that proposed techniques are more efficient than the previous algorithms. In the future, we would like to study and propose approximate searching algorithms for 3-D structures of proteins.
Q & A