Download presentation
Presentation is loading. Please wait.
Published byShannon Cox Modified over 9 years ago
1
EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/
2
Protein Structure Similarity
3
2015-10-5EECS 7303 Secondary Structure Elements: helices strands/sheets & loops
4
2015-10-5EECS 7304 Structure Prediction/Determination Computational tools Homology, threading Molecular dynamics Experimental tools NMR spectrometry X-ray crystallography
5
2015-10-5EECS 7305 The State of the Strucutre Space 1990 250 new structures 1999 2500 new structures 2000 >20,000 structures total 2004 ~30,000 structures total Only about 10% of structures have been determined for known protein sequences Protein Structure Initiative (PSI)
6
2015-10-5EECS 7306 Structure Similarity Refers to how well (or poorly) 3D folded structures of proteins can be aligned Expected to reflect functional similarities (interaction with other molecules) Proteins in the TIM barrel fold family
7
2015-10-5EECS 7307 Alignment of 1xis and 1nar (TIM-Barrels) Alignment computed by DALI helix axes 1xis 1nar Sayle, R. RasMol. A protein visualization tool. http://www.umass.edu/microbio/rasmol /index2.htm. ribbon format backbone format
8
2015-10-5EECS 7308 Structure Similarity Refers to how well (or poorly) 3D folded structures of proteins can be aligned Is expected to reflect functional similarities (interaction with other molecules) 2007: ~ 34,000 structures in PDB ~ 1,000 different folds (1:34 ratio)
9
2015-10-5EECS 7309
10
2015-10-5EECS 73010
11
2015-10-5EECS 73011 Structure Similarity Refers to how well (or poorly) 3D folded structures of proteins can be aligned Is expected to reflect functional similarities (interaction with other molecules) 2000: ~ 20,000 structures in PDB ~ 4,000 different folds (1:5 ratio) Three possible reasons: - evolution, - physical constraints (e.g., few ways to maximize hydrophobic interactions), - limits in techniques used for structure determination Given a new structure, the probability is high that it is similar to an existing one
12
2015-10-5EECS 73012 SequenceStructureFunction sequence similarity Why Compute Structure Similarity? Low sequence similarity may yield very similar structures Sometimes high sequence similarity yields different structures
13
2015-10-5EECS 73013 Alignment of 1xis and 1nar (TIM-Barrels) 1xis and 1nar have only 7% sequence identity, but approximately 70% of the residues are structurally similar
14
2015-10-5EECS 73014 SequenceStructureFunction sequence similarity structure similarity Why Compute Structure Similarity? Low sequence similarity may yield very similar structures Sometimes high sequence similarity yields different structures Structure comparison is expected to provide more pertinent information about functional (dis-)similarity among proteins, especially with non-evolutionary relationships or non-detectable evolutionary relationships
15
2015-10-5EECS 73015 Ill-Posed Problem Multiple Terminology (Dis-)similarity analysis Structure comparison Alignment, superposition, matching Classification Definitions Applications Methods Issues
16
2015-10-5EECS 73016 A Few Web Sites Protein Data Bank (PDB): http://www.rcsb.org/pdb/ http://www.rcsb.org/pdb/ Protein classification: SCOP: http://scop.berkeley.edu/ http://scop.berkeley.edu/ CATH http://www.biochem.ucl.ac.uk/bsm/cath/ http://www.biochem.ucl.ac.uk/bsm/cath/ Protein alignment: DALI: http://www.ebi.ac.uk/dali/ http://www.ebi.ac.uk/dali/ LOCK: http://motif.stanford.edu/lock2/ http://motif.stanford.edu/lock2/
17
2015-10-5EECS 73017 3D Molecular Structure Collection of (possibly typed) atoms or groups of atoms in some given 3D relative placement The placement of a group of atoms is defined by the position of a reference point (e.g., the center of an atom) and the orientation of a reference direction The type can be the atom ID, the amino- acid ID, etc …
18
2015-10-5EECS 73018 Matching of Structures Two structures A and B match if: 1. Correspondence: There is a one-to-one map between their elements 2. Alignment: There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold .
19
2015-10-5EECS 73019 Complete Match
20
2015-10-5EECS 73020 Alignment of 3adk and 1gky But a complete match is rarely possible: The molecules have different sizes Their shapes are only locally similar Both matching and non-matching secondary structure elements
21
2015-10-5EECS 73021 Partial Match Notion of support σ of the match: the match is between σ(A) and σ(B) Dual problem: - What is the support? - What is the transform? Often several (many) possible supports Small supports motifs
22
2015-10-5EECS 73022 Mathematical Relative f g ||f g|| 2 s Over which support?
23
2015-10-5EECS 73023 Mathematical Relative f g ||f g|| 2 s Over which support?
24
2015-10-5EECS 73024 Application #1: Find Global Similarities Among Protein Structures Given two protein structures, find the largest similar substructures For example, a substructure is a subset of C atoms or a subset of secondary structure elements in each molecule Several possible similarity measures Variants: 1-to-1, 1-to-many, many-to-many (PDB) Must be automatic (and fast)
25
2015-10-5EECS 73025 Application #2: Classify Proteins Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] Hierarchical classification Insight into functions and structure stabilization Basis for homology and threading Manual classification SCOP [Murzin et al., 1995]
26
2015-10-5EECS 73026 Application #2: Classify Proteins Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997] Hierarchical classification Insight into functions and structure stabilization Basis for homology and threading Manual classification SCOP [Murzin et al., 1995] Increasing size of PDB Automatic classifiers: CATH [Orengo et al., 1997]; Pclass [Singh et al.]; FSSP [Holm and Sander] Class: Similar secondary structure content Fold: SSE’s in similar arrangement Family: Clear evolutionary relationship
27
2015-10-5EECS 73027 Manuel vs. Automatic Classification
28
2015-10-5EECS 73028 Application #3: Find Motif in Protein Structure Given a protein structure and a motif (e.g., a small collection of atoms corresponding to a binding site) Find whether the motif matches a substructure of the protein Variant: One motif against many proteins Active sites of 1PIP and 5PAD. Only 3 amino-acids participate in the motif
29
2015-10-5EECS 73029 Application #4: Find Pharmacophore Given: Small collection (5-10) of small flexible ligands with similar activity (hence, assumed to bind at same protein site) Low-energy conformations (several dozens to few 100 ’ s) for each ligand Find substructure (pharmacophore) that occurs in at least one conformation of each ligand Key problem in drug design when binding site is unknown
30
2015-10-5EECS 73030 Application #4: Find Pharmacophore 1TLP 4TMN 5TMN 6TMN Inhibitors of thermolysin Clusters of low-energy conformations of 1TLP The 4 ligands overlapped with their pharmacophore matched
31
2015-10-5EECS 73031 Application #5: Search for Ligands Containing a Pharmacophore Given: Database containing several 100,000, or more, small ligands A pharmacophore P Find all ligands that have a low-energy conformation containing P Data mining of pharmaceutical databases (lead generation) S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C. Latombe. A Randomized Kinematics-Based Approach to Pharmacophore-Constrained Conformational Search and Database Screening. J. of Computational Chemistry, 21(9):731-747, July 2000
32
2015-10-5EECS 73032 Definitions Applications Methods Issues
33
2015-10-5EECS 73033 Multiple Partial Matches
34
2015-10-5EECS 73034 Distributed Support B A B A Gap σ(A) σ(B)
35
2015-10-5EECS 73035 What is Best? B A B A Should gaps be penalized?
36
2015-10-5EECS 73036 What About This? B A Sequence along backbone is not preserved
37
2015-10-5EECS 73037 Similarity measure is unlikely to satisfy triangular inequality for partial match
38
2015-10-5EECS 73038 Compute Structure Similarity Structure presentation Similarity measurement Computational solution
39
2015-10-5EECS 73039 Structure presentation Element based representation A structure is broken down to a list of structure elements We represent a protein structure by its geometry, topology, and attributes: Geometry: the coordinates of the elements Topology: the physical and chemical interaction of elements Attributes: the physical and chemical attributes of the elements
40
2015-10-5EECS 73040 Structure Representation There are three major groups of structure presentation Point list: treat protein as a list of points in a 3D space Point set: treat protein as a set of points in a 3D space Graphs: treat protein as a graph
41
2015-10-5EECS 73041 Comparing two point sets Similarity measure: Given two point set P = {p 1, p 2, …, p n } and Q = {q 1, q 2, …, q m }, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 mapping f from P to Q such that S (P, Q) = sqrt( i d 2 (p i, T(f(p i )) ) is minimized. S is called the RMSD (root-mean-spared-distance) between the two structures
42
2015-10-5EECS 73042 Comparing two point sets If m = n, there is a close-form solution to find the exact solution to the problem of comparing the two point sets If m ≠n, the problem is much harder
43
2015-10-5EECS 73043 Common Point Subset Problem Find the largest common point subset Given two point set P = {p 1, p 2, …, p n } and Q = {q 1, q 2, …, q m }, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 partial mapping f with maximal cardinality from P to Q such that d (p i, T(f(p i )) ) < t for all i defined in f Also a harder problem (but not a NP-hard problem)
44
2015-10-5EECS 73044 Geometric Hashing Originally used for automatic visual recognition of geometric figures The principle We have two geometric figures model A with m points (can have several models) quary B with n points Discover similar subfigures in A and B invariant under placement, rotation (and often size) Let the figures be described by points Try to find the largest set of points from (A, B) with coinciding points
45
2015-10-5EECS 73045 Coinciding points Example from 2 dimension Find six overlapping pairs (1,a)(2,d)(3,c)(4,e)(6,f)(7,g) The coinciding pairs are independent of the labeling Note that the figures can be translated and rotated
46
2015-10-5EECS 73046 Reference frames The points of the figures are specified in coordinate systems or reference frames A reference frame can in 2D be defined by two points Choose two points from A (a i,a k ) and two from B (b j, b l ), called basises, and define the reference frames (RF) from the basises Example: origin in a i and the x-axis along the line a i,a k, or origin at the middle of a i,a k Find the positions in RF of all the other points, called reference frame system, RFS ”Overlap” (the x,y-axes) RFS A and RFS B, and count the number of coinciding points
47
2015-10-5EECS 73047 Reference frame system, example Model (1,3) [(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)] four coinciding points Query (a,c) [(0,0)(3,-2) (8,0)(6,2)(10,4)(3,8)(0,6)] only the origins coincidies Model (3,5) [(0,0)(1,8)(2,2)(4,-2) (10,0)(8,3)(8,7)]
48
2015-10-5EECS 73048 Comparison of (Reference) Frame Systems The number of coinciding points depends on the basises Should therefore try all possible pairs as basises This would result in m(m-1)n(n-1) comparison of reference frame systems, but many of those comparisons are redundant Geometric hashing is used for efficiently performing ”simultaneously” many comparisons
49
2015-10-5EECS 73049 Hashing Compare simultaneously a query frame system to all model frame systems Assume a 2D hashing table H, a simple hashing function One bucket for each square of the frame system, identified by (p,q) Let (u,v) H(p,q) mean that the frame system with basis (u,v) has a point in the square (p,q) (a very simple hash function) H is filled in a preprocessing of the model
50
2015-10-5EECS 73050 Hashing preprocessing example
51
2015-10-5EECS 73051 Recognition Compare the query with the model (several frame systems) Select a basis in the query and define reference frame Find the positions in the reference frame to all the other points For each point r in the query reference system do Calculate the position (x,y) in H that r hashes to Vote one for each model reference system in H(x,y) End Recognize the model reference systems with highest votes Repeat for more query reference systems, if not enough coinciding points are found
52
2015-10-5EECS 73052 Example recognition query (a,c) [(0,0) (3,-2) (8,0)(6,2)(10,4)(3,8)(0, 6)]
53
2015-10-5EECS 73053 Use of several models Can have several models in the same hashtable Must then identify model and reference system in the hash table Example: Have a database of structures, stored in a hashtable
54
2015-10-5EECS 73054 Geometric hashing for structure comparison Need methods invariant under translation and rotation Use geometric hashing to find subsets and coincident residues (points), residues that superpose well 1. Define referance frames Three atoms can be used: a i, a k, a r. Example Origin in a i The x-axis along a i,a k The y-axis in the plane defined of a i, a k, a r in counterclockwise The z-axis orthogonal to the plane 2. The residues may have labels (attributes) 1. Implement the labels explicit: in the hastable 2. Implement the labels implicit: in the hashing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.