Download presentation
Presentation is loading. Please wait.
Published byJocelyn Edwards Modified over 9 years ago
1
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009
2
2 Aligning protein structures Important step for understanding protein functions Sequencing proteins and determining 3D structures is easy X-ray crystallography, NMR spectroscopy Testing functions of proteins is hard One useful observation Mutations change sequences Structures conserved Structural similarity => Functional similarity Good structural alignment algorithm => Predict functions of proteins
3
3 Our focus We propose studying the problem with less information or assumptions Sequence order independence Sequence order (arrangement of amino acids) is unknown Reduce the need to find sequence information Subset alignment Find large alignment for all subsets Extract similar structures from a mixture with dissimilar ones Bottleneck metric In an alignment, every pair of aligned points have a small distance
4
4 Related work Pairwise alignment Dali [Holm & Sander 93] VAST [Gibrat, Madej, & Bryant 96] CE [Shindyalov & Bourne 98] Techniques to obtain multiple alignments center-star[Akutsu & Kim 99] Tree-progress[Taylor, Flores, & Orengo 94] Multiple alignment MultiProt[Shatsky et. al. 02](seq. order) MultiBind[Shatsky et. al. 06](all align.) MUSTA[Leibowitz et. al. 01](all align.) MASS[Dror et. al. 03](seq. order) POSA[Ye & Godzik 05](seq. order)
5
5 In the followings Model Algorithms SOIL Experimental results
6
6 Model A protein is a set of amino acid in 3D, and an amino acid = 3 points in space For C α -atom, C, N Substructure = subset of amino acid Transformation T(S) For each s S, T(s) = Rs + t, where R is a 3 × 3 rotation matrix t is a 3 × 1 translation matrix Similarity C = {c1, …, cn} a set of substructures, T = {T1, … Tn} be a set of transformation C is ε-congruent w.r.t. T if we can transform each structures in C and align the amino acids such that the Cα items of every aligned pairs are close (<=ε) A ε-congruent alignment For a set of S of structures, an alignment is set of substructures C and transformations T Rotate Translate S1S1 S2S2
7
7 Problem definition Size of an alignment: number of aligned amino acid or each protein Cardinality: number of structures involebed. Input A set of structures S = {S 1, S 2, …, S m } A distance threshold A subset size threshold min_cardinality An alignment length threshold min_size Output For each subset S’ S with |S’| min_size, the maximal length –congruent alignment whose length is at least min_length
8
8 The SOIL Algorithm Sequence Order Independent aLignment Step 1. Geometric hashing Step 2. Frequent pattern mining Step 3. Generating alignments
9
9 Geometric Hashing Purpose Take each amino acid as a base (reference) and store the relative location of other amino acids in a hashtable. 31 2 4 5 S1S1 S2S2 1 2 3 4 5 Store the base Length of box = ε
10
10 Mining Frequent Patterns Main observation. Assume that a pair of bases {(k1, i1) {k2, i2)} appears in x boxes. Then if structures S k1 and S k2 are transformed using the bases for S k1 i1 and S k1 i2, there are at least x+1 pairs of points locating closely with each other (distance at most √3ε, i.e., diagonal length). Proof. Why (k1,i1) is in a box? When S k1 is transformed using the base S k1 i1, an amino acid locates at that box
11
11 Mining Frequent Patterns Let each hashbox be a coincidence group, or transaction. Consider all bases as items Find all sets of items that appear frequently in the coincidence group. “Frequent pattern mining problem”, a well-studied problem in database area. Efficient algorithms, like fp-tree, are known Efficient, can consider all possible transformations at the same time
12
12 Generating Alignments Given a frequent pattern E.g., (S 1 2, S 2 1 ) Use the bases in a tuple to transform the structures involved Generate a matching of points, bipartite matching for pairwise, greedy for multiple Output the largest alignment x y S1S1 S3S3 51 15 24 33 42 AlignmentTransformed S 1 and S 3
13
13 Experimental evaluation Implemented in C++ Test cases run on Intel ® Core TM 2 Duo with 2.66GHz CPU and 4GB main memory Default settings : 3Å min_size: 2 LRF: 3Atoms Coincidence group: Bin max_trans: 30Avg
14
14 Pairwise alignment 10 pairs of proteins used before, e.g., MultiProt SCOP and PRINT families Comparison of running time C-alpha match: within a few seconds (from web) MultiProt: 0.211s MultiBind: 1.968s SOIL: 0.235s
15
Multiple alignments 10 groups of proteins Various superfamilies in SCOP, protein interfaces from PRINT
16
16 Multiple alignments 6 5 4 3 2 4 3 2 5 4 3 2 3 2 10 9 8 7 6 5 4 3 2 (Levels) Calcium Binding 4-helix Bundle Superhelix Supersandwich Concanavalin 4 3 2 6 5 4 3 2 5 4 3 2 3 2 3 2 (Levels) tRNA synthetase G-proteins PTB domain PRINT 45 PRINT 8158
17
17 Multiple alignment
18
18 Conclusion Proposed a more difficult problem Sequence order independence Modeled as the largest common point set problem Subset alignment Automatically detect subsets of similar structures Similarity measurement Adopt the bottleneck metric Developed the SOIL algorithm Combination of Geometric Hashing and Frequent Itemset Mining Simultaneous alignment Evaluated the algorithm with experiments Can be combined with other methods by simply taking the maximum.
19
19 Future Work Variations of the problem Scoring functions Disk-based solution Other applications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.