Download presentation
Presentation is loading. Please wait.
2
Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results
3
Introduction
4
Importance Protein analysis: Protein classification Detecting functional units which share similar geometrical configurations. Applications to: Docking Protein engineering Drug design: Pharmacophore searching
5
The LCP Problem Given a collection of M point-sets in 3D space, find the largest common subset. Known as the LCP problem The LCP problem is NP-hard. All solutions are based on some heuristics.
6
The Multiple Alignment by Secondary Structures (MASS) Algorithm
7
Motivation The MUSTA algorithm Leibowitz, Fligelman, Nussinov, and Wolfson 1999 A truly multiple-based approach Desired improvements: Efficiency Finding partial solutions i.e. alignments between a subset of the input molecules.
8
Partial Alignments AA B B A B CC Two types of partial alignments: B & C
9
General Strategy Pivot scheme Based on a two-level alignment: Local secondary structure superposition Global atomic superposition Geometric hashing paradigm
10
Why Secondary Structure? Stability: Secondary structures are conserved during evolution Robustness: Proteins are dense molecules Efficiency: Introduces great savings in structural description
11
The Pairwise Case Outline: SSE assignment SSE representation Detection of seed matches Clustering the seed matches Global extension & refinement SSE Representation Atomic Representation
12
Step 1: SSE assignment The proteins are represented by their secondary structure elements.
13
Secondary Structure Element (SSE) HelixStrand abundant π rare 3 10 infrequent Alpha abundant
14
Secondary Structure Assignment PDB Bernstein et al 1977 DSSP Kabsch & Sander 1983 DSSPCont Andersen et al. 2002 STICK Taylor 2001
15
Step 2: SSE representation A SSE is represented by a 3D line segment with fuzzy endpoints. Helix representation:
16
Strand representation: least squares line N-terminus C-terminus
17
The SSE least-square line minimizes: Cα Atom (xi,yi) di
18
Step 3: detection of seed matches Base – SSE pair Finding bases, whose configuration appears in both proteins. A base configuration is represented by a fingerprint
19
A base fingerprint is a 5D vector composed of: SSE types: helix, strand Line distance Midpoint distance Angle
20
midpoint distance line distance
21
The fingerprint is invariant to 3D rigid transformation Bases with a similar fingerprint can be aligned in different ways: Axis system superposition Midpoint to midpoint alignment RMSD minimization
22
Axis system superposition: Axis system superposition: Define an axis-system on each base: SSE 2 SSE 1 Z-Axis X-Axis Y-Axis
23
Superimpose the axis-systems of matched bases. Z-Axis X-Axis Y-Axis
24
Based on the assumption: The line distance segments are conserved Pros: No use of the SSE length and endpoints Cons: The assumption is not always correct. Pathological Example in 2D: d d=0
25
Midpoint to midpoint alignment: Midpoint to midpoint alignment: Align the mid Cα atoms Expand to the sides
26
Based on the assumptions: SSE endpoints are fuzzy SSE midpoints are conserved. Pros: Simplicity Cons: The SSE midpoints are not always conserved. The DSSP sometimes split a SSE in two
27
RMSD minimization: RMSD minimization: Iterate over all the possible atomic alignment between the matched SSEs. Choose the alignment that minimizes the RMSD
28
Pros: No assumption Cons: Convergence to a local minimum instead of a global one.
29
To find congruent bases efficiently: All bases are stored in a geometric hash according to their fingerprint. GH
30
Bases that reside in the same hash bin or in adjacent bins are congruent: ε 2D Cut: ε - tolerance
31
For each hash bin: Retrieve all the bases in the bin and in the adjacent bins Insert the bases into a combinatorial bucket Two bases from different column define a seed match Protein 1Protein 2 3 x 2 seed matches
32
Step 4: clustering the seed matches Detecting matches with a similar transformation and join them into clusters. Using RMSD clustering: Similar to (Rarey 1996) Works in an iterative manner
33
T1 T2 T3 T4 T6 T5 1 2 3 1 3
34
Step 5: global extension & refinement For each match: Apply its transformation Find corresponding atoms that lie close enough to each other after the superposition. Use least-squares fitting to refine the transformation Iterate until the RMSD convergence.
35
The Multiple Case Outline: SSE assignment & representation Detection of seed matches Clustering the seed pairwise matches Global extension of pairwise matches Computing multiple matches Refinement Selecting high-scoring multiple matches
36
Finding bases whose configuration appears in sufficient number of molecules: All bases are stored in a geometric hash according to their fingerprint. Bases that reside in the same bin or in adjacent bins are congruent.
37
For each hash bin: Retrieve all the bases in the bin and in the adjacent bins Insert them into a combinatorial bucket (CB): Protein i Protein j Protein k Protein r Protein s i<j<k<r<s
38
Construct pairwise seed matches. The reference protein is the one with the smaller index Cluster the pairwise matches Global extend the pairwise matches
39
Recursively construct multiple alignment: Protein i Protein j Protein k Protein r Protein s i<j<k<r<s
40
Refinement Selecting high-scoring multiple matches The score of a multiple match with n proteins and k atoms is given by: n = 3 k = 4 score = 12
41
Experimental Results
42
MASS vs. MUSTA
43
Partial Solutions
44
All-alpha Class The core between ten proteins. The proteins belong to 4 different folds of the all-alpha class.
46
Tim-barrel Fold The core between 6 proteins out of 7 proteins, taken from different super families of the tim-barrel fold
47
Calcium Binding The core of 6 proteins, belong to 3 different families of the EF hand-like super family
48
Lipase Family An alignment of four structures from different species of the Lipase Family. Two of the conformations are open and two of them are closed.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.