Presentation is loading. Please wait.

Presentation is loading. Please wait.

A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.

Similar presentations


Presentation on theme: "A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009."— Presentation transcript:

1 A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009

2 2 Aligning protein structures  Important step for understanding protein functions  Sequencing proteins and determining 3D structures is easy X-ray crystallography, NMR spectroscopy  Testing functions of proteins is hard  One useful observation Mutations change sequences Structures conserved Structural similarity => Functional similarity  Good structural alignment algorithm => Predict functions of proteins

3 3 Our focus  We propose studying the problem with less information or assumptions  Sequence order independence Sequence order (arrangement of amino acids) is unknown Reduce the need to find sequence information  Subset alignment Find large alignment for all subsets Extract similar structures from a mixture with dissimilar ones  Bottleneck metric In an alignment, every pair of aligned points have a small distance

4 4 Related work  Pairwise alignment Dali [Holm & Sander 93] VAST [Gibrat, Madej, & Bryant 96] CE [Shindyalov & Bourne 98]  Techniques to obtain multiple alignments center-star[Akutsu & Kim 99] Tree-progress[Taylor, Flores, & Orengo 94]  Multiple alignment MultiProt[Shatsky et. al. 02](seq. order) MultiBind[Shatsky et. al. 06](all align.) MUSTA[Leibowitz et. al. 01](all align.) MASS[Dror et. al. 03](seq. order) POSA[Ye & Godzik 05](seq. order)

5 5 In the followings  Model  Algorithms SOIL  Experimental results

6 6 Model  A protein is a set of amino acid in 3D, and an amino acid = 3 points in space For C α -atom, C, N Substructure = subset of amino acid  Transformation T(S) For each s  S, T(s) = Rs + t, where R is a 3 × 3 rotation matrix t is a 3 × 1 translation matrix  Similarity C = {c1, …, cn} a set of substructures, T = {T1, … Tn} be a set of transformation C is ε-congruent w.r.t. T if we can transform each structures in C and align the amino acids such that the Cα items of every aligned pairs are close (<=ε)  A ε-congruent alignment For a set of S of structures, an alignment is set of substructures C and transformations T Rotate Translate S1S1 S2S2

7 7 Problem definition  Size of an alignment: number of aligned amino acid or each protein  Cardinality: number of structures involebed.  Input A set of structures S = {S 1, S 2, …, S m } A distance threshold  A subset size threshold min_cardinality An alignment length threshold min_size  Output For each subset S’  S with |S’|  min_size, the maximal length –congruent alignment whose length is at least min_length

8 8 The SOIL Algorithm  Sequence Order Independent aLignment Step 1. Geometric hashing Step 2. Frequent pattern mining Step 3. Generating alignments

9 9 Geometric Hashing  Purpose Take each amino acid as a base (reference) and store the relative location of other amino acids in a hashtable. 31 2 4 5 S1S1 S2S2 1 2 3 4 5 Store the base Length of box = ε

10 10 Mining Frequent Patterns  Main observation. Assume that a pair of bases {(k1, i1) {k2, i2)} appears in x boxes. Then if structures S k1 and S k2 are transformed using the bases for S k1 i1 and S k1 i2, there are at least x+1 pairs of points locating closely with each other (distance at most √3ε, i.e., diagonal length).  Proof. Why (k1,i1) is in a box?  When S k1 is transformed using the base S k1 i1, an amino acid locates at that box

11 11 Mining Frequent Patterns  Let each hashbox be a coincidence group, or transaction.  Consider all bases as items  Find all sets of items that appear frequently in the coincidence group.  “Frequent pattern mining problem”, a well-studied problem in database area.  Efficient algorithms, like fp-tree, are known  Efficient, can consider all possible transformations at the same time

12 12 Generating Alignments  Given a frequent pattern E.g., (S 1 2, S 2 1 ) Use the bases in a tuple to transform the structures involved Generate a matching of points, bipartite matching for pairwise, greedy for multiple Output the largest alignment x y S1S1 S3S3 51 15 24 33 42 AlignmentTransformed S 1 and S 3

13 13 Experimental evaluation  Implemented in C++  Test cases run on Intel ® Core TM 2 Duo with 2.66GHz CPU and 4GB main memory  Default settings : 3Å min_size: 2 LRF: 3Atoms Coincidence group: Bin max_trans: 30Avg

14 14 Pairwise alignment  10 pairs of proteins used before, e.g., MultiProt  SCOP and PRINT families  Comparison of running time C-alpha match: within a few seconds (from web) MultiProt: 0.211s MultiBind: 1.968s SOIL: 0.235s

15 Multiple alignments  10 groups of proteins  Various superfamilies in SCOP, protein interfaces from PRINT

16 16 Multiple alignments 6 5 4 3 2 4 3 2 5 4 3 2 3 2 10 9 8 7 6 5 4 3 2 (Levels) Calcium Binding 4-helix Bundle Superhelix Supersandwich Concanavalin 4 3 2 6 5 4 3 2 5 4 3 2 3 2 3 2 (Levels) tRNA synthetase G-proteins PTB domain PRINT 45 PRINT 8158

17 17 Multiple alignment

18 18 Conclusion  Proposed a more difficult problem Sequence order independence  Modeled as the largest common point set problem Subset alignment  Automatically detect subsets of similar structures Similarity measurement  Adopt the bottleneck metric  Developed the SOIL algorithm Combination of Geometric Hashing and Frequent Itemset Mining Simultaneous alignment  Evaluated the algorithm with experiments Can be combined with other methods by simply taking the maximum.

19 19 Future Work  Variations of the problem  Scoring functions  Disk-based solution  Other applications


Download ppt "A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009."

Similar presentations


Ads by Google