Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science

Similar presentations


Presentation on theme: "EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science"— Presentation transcript:

1 EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

2 Protein Structure Similarity

3 2015-10-5EECS 7303 Secondary Structure Elements:  helices  strands/sheets  & loops

4 2015-10-5EECS 7304 Structure Prediction/Determination  Computational tools Homology, threading Molecular dynamics  Experimental tools NMR spectrometry X-ray crystallography

5 2015-10-5EECS 7305 The State of the Strucutre Space 1990  250 new structures 1999  2500 new structures 2000  >20,000 structures total 2004  ~30,000 structures total Only about 10% of structures have been determined for known protein sequences  Protein Structure Initiative (PSI)

6 2015-10-5EECS 7306 Structure Similarity  Refers to how well (or poorly) 3D folded structures of proteins can be aligned  Expected to reflect functional similarities (interaction with other molecules) Proteins in the TIM barrel fold family

7 2015-10-5EECS 7307 Alignment of 1xis and 1nar (TIM-Barrels) Alignment computed by DALI  helix axes 1xis 1nar Sayle, R. RasMol. A protein visualization tool. http://www.umass.edu/microbio/rasmol /index2.htm. ribbon format backbone format

8 2015-10-5EECS 7308 Structure Similarity  Refers to how well (or poorly) 3D folded structures of proteins can be aligned  Is expected to reflect functional similarities (interaction with other molecules)  2007: ~ 34,000 structures in PDB ~ 1,000 different folds (1:34 ratio)

9 2015-10-5EECS 7309

10 2015-10-5EECS 73010

11 2015-10-5EECS 73011 Structure Similarity  Refers to how well (or poorly) 3D folded structures of proteins can be aligned  Is expected to reflect functional similarities (interaction with other molecules)  2000: ~ 20,000 structures in PDB ~ 4,000 different folds (1:5 ratio)  Three possible reasons: - evolution, - physical constraints (e.g., few ways to maximize hydrophobic interactions), - limits in techniques used for structure determination  Given a new structure, the probability is high that it is similar to an existing one

12 2015-10-5EECS 73012 SequenceStructureFunction sequence similarity Why Compute Structure Similarity?  Low sequence similarity may yield very similar structures  Sometimes high sequence similarity yields different structures

13 2015-10-5EECS 73013 Alignment of 1xis and 1nar (TIM-Barrels) 1xis and 1nar have only 7% sequence identity, but approximately 70% of the residues are structurally similar

14 2015-10-5EECS 73014 SequenceStructureFunction sequence similarity structure similarity Why Compute Structure Similarity?  Low sequence similarity may yield very similar structures  Sometimes high sequence similarity yields different structures  Structure comparison is expected to provide more pertinent information about functional (dis-)similarity among proteins, especially with non-evolutionary relationships or non-detectable evolutionary relationships

15 2015-10-5EECS 73015 Ill-Posed Problem  Multiple Terminology  (Dis-)similarity analysis  Structure comparison  Alignment, superposition, matching  Classification  Definitions  Applications  Methods  Issues

16 2015-10-5EECS 73016 A Few Web Sites  Protein Data Bank (PDB): http://www.rcsb.org/pdb/ http://www.rcsb.org/pdb/  Protein classification:  SCOP: http://scop.berkeley.edu/ http://scop.berkeley.edu/  CATH http://www.biochem.ucl.ac.uk/bsm/cath/ http://www.biochem.ucl.ac.uk/bsm/cath/  Protein alignment:  DALI: http://www.ebi.ac.uk/dali/ http://www.ebi.ac.uk/dali/  LOCK: http://motif.stanford.edu/lock2/ http://motif.stanford.edu/lock2/

17 2015-10-5EECS 73017 3D Molecular Structure  Collection of (possibly typed) atoms or groups of atoms in some given 3D relative placement  The placement of a group of atoms is defined by the position of a reference point (e.g., the center of an atom) and the orientation of a reference direction  The type can be the atom ID, the amino- acid ID, etc …

18 2015-10-5EECS 73018 Matching of Structures Two structures A and B match if: 1. Correspondence: There is a one-to-one map between their elements 2. Alignment: There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold .

19 2015-10-5EECS 73019 Complete Match

20 2015-10-5EECS 73020 Alignment of 3adk and 1gky But a complete match is rarely possible:  The molecules have different sizes  Their shapes are only locally similar Both matching and non-matching secondary structure elements

21 2015-10-5EECS 73021 Partial Match  Notion of support σ of the match: the match is between σ(A) and σ(B)   Dual problem: - What is the support? - What is the transform?  Often several (many) possible supports  Small supports  motifs

22 2015-10-5EECS 73022 Mathematical Relative f g ||f  g|| 2 s Over which support?

23 2015-10-5EECS 73023 Mathematical Relative f g ||f  g|| 2 s Over which support?

24 2015-10-5EECS 73024 Application #1: Find Global Similarities Among Protein Structures  Given two protein structures, find the largest similar substructures  For example, a substructure is a subset of C  atoms or a subset of secondary structure elements in each molecule  Several possible similarity measures  Variants: 1-to-1, 1-to-many, many-to-many (PDB)  Must be automatic (and fast)

25 2015-10-5EECS 73025 Application #2: Classify Proteins  Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]  Hierarchical classification  Insight into functions and structure stabilization  Basis for homology and threading  Manual classification  SCOP [Murzin et al., 1995]

26 2015-10-5EECS 73026 Application #2: Classify Proteins  Many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]  Hierarchical classification  Insight into functions and structure stabilization  Basis for homology and threading  Manual classification  SCOP [Murzin et al., 1995]  Increasing size of PDB  Automatic classifiers: CATH [Orengo et al., 1997]; Pclass [Singh et al.]; FSSP [Holm and Sander] Class: Similar secondary structure content Fold: SSE’s in similar arrangement Family: Clear evolutionary relationship

27 2015-10-5EECS 73027 Manuel vs. Automatic Classification

28 2015-10-5EECS 73028 Application #3: Find Motif in Protein Structure  Given a protein structure and a motif (e.g., a small collection of atoms corresponding to a binding site)  Find whether the motif matches a substructure of the protein  Variant: One motif against many proteins Active sites of 1PIP and 5PAD. Only 3 amino-acids participate in the motif

29 2015-10-5EECS 73029 Application #4: Find Pharmacophore  Given: Small collection (5-10) of small flexible ligands with similar activity (hence, assumed to bind at same protein site) Low-energy conformations (several dozens to few 100 ’ s) for each ligand  Find substructure (pharmacophore) that occurs in at least one conformation of each ligand  Key problem in drug design when binding site is unknown

30 2015-10-5EECS 73030 Application #4: Find Pharmacophore 1TLP 4TMN 5TMN 6TMN Inhibitors of thermolysin Clusters of low-energy conformations of 1TLP The 4 ligands overlapped with their pharmacophore matched

31 2015-10-5EECS 73031 Application #5: Search for Ligands Containing a Pharmacophore  Given: Database containing several 100,000, or more, small ligands A pharmacophore P  Find all ligands that have a low-energy conformation containing P  Data mining of pharmaceutical databases (lead generation) S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C. Latombe. A Randomized Kinematics-Based Approach to Pharmacophore-Constrained Conformational Search and Database Screening. J. of Computational Chemistry, 21(9):731-747, July 2000

32 2015-10-5EECS 73032  Definitions  Applications  Methods  Issues

33 2015-10-5EECS 73033 Multiple Partial Matches

34 2015-10-5EECS 73034 Distributed Support B A B A Gap σ(A) σ(B)

35 2015-10-5EECS 73035 What is Best? B A B A Should gaps be penalized?

36 2015-10-5EECS 73036 What About This? B A Sequence along backbone is not preserved

37 2015-10-5EECS 73037 Similarity measure is unlikely to satisfy triangular inequality for partial match 

38 2015-10-5EECS 73038 Compute Structure Similarity Structure presentation Similarity measurement Computational solution

39 2015-10-5EECS 73039 Structure presentation Element based representation A structure is broken down to a list of structure elements We represent a protein structure by its geometry, topology, and attributes: Geometry: the coordinates of the elements Topology: the physical and chemical interaction of elements Attributes: the physical and chemical attributes of the elements

40 2015-10-5EECS 73040 Structure Representation There are three major groups of structure presentation Point list: treat protein as a list of points in a 3D space Point set: treat protein as a set of points in a 3D space Graphs: treat protein as a graph

41 2015-10-5EECS 73041 Comparing two point sets Similarity measure: Given two point set P = {p 1, p 2, …, p n } and Q = {q 1, q 2, …, q m }, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 mapping f from P to Q such that S (P, Q) = sqrt(  i d 2 (p i, T(f(p i )) ) is minimized. S is called the RMSD (root-mean-spared-distance) between the two structures

42 2015-10-5EECS 73042 Comparing two point sets If m = n, there is a close-form solution to find the exact solution to the problem of comparing the two point sets If m ≠n, the problem is much harder

43 2015-10-5EECS 73043 Common Point Subset Problem Find the largest common point subset Given two point set P = {p 1, p 2, …, p n } and Q = {q 1, q 2, …, q m }, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 partial mapping f with maximal cardinality from P to Q such that d (p i, T(f(p i )) ) < t for all i defined in f Also a harder problem (but not a NP-hard problem)

44 2015-10-5EECS 73044 Geometric Hashing Originally used for automatic visual recognition of geometric figures The principle We have two geometric figures model A with m points (can have several models) quary B with n points Discover similar subfigures in A and B invariant under placement, rotation (and often size) Let the figures be described by points Try to find the largest set of points from (A, B) with coinciding points

45 2015-10-5EECS 73045 Coinciding points Example from 2 dimension Find six overlapping pairs (1,a)(2,d)(3,c)(4,e)(6,f)(7,g) The coinciding pairs are independent of the labeling Note that the figures can be translated and rotated

46 2015-10-5EECS 73046 Reference frames The points of the figures are specified in coordinate systems or reference frames A reference frame can in 2D be defined by two points Choose two points from A (a i,a k ) and two from B (b j, b l ), called basises, and define the reference frames (RF) from the basises Example: origin in a i and the x-axis along the line a i,a k, or origin at the middle of a i,a k Find the positions in RF of all the other points, called reference frame system, RFS ”Overlap” (the x,y-axes) RFS A and RFS B, and count the number of coinciding points

47 2015-10-5EECS 73047 Reference frame system, example Model (1,3) [(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)] four coinciding points Query (a,c) [(0,0)(3,-2) (8,0)(6,2)(10,4)(3,8)(0,6)] only the origins coincidies Model (3,5) [(0,0)(1,8)(2,2)(4,-2) (10,0)(8,3)(8,7)]

48 2015-10-5EECS 73048 Comparison of (Reference) Frame Systems The number of coinciding points depends on the basises Should therefore try all possible pairs as basises This would result in m(m-1)n(n-1) comparison of reference frame systems, but many of those comparisons are redundant Geometric hashing is used for efficiently performing ”simultaneously” many comparisons

49 2015-10-5EECS 73049 Hashing Compare simultaneously a query frame system to all model frame systems Assume a 2D hashing table H, a simple hashing function One bucket for each square of the frame system, identified by (p,q) Let (u,v)  H(p,q) mean that the frame system with basis (u,v) has a point in the square (p,q) (a very simple hash function) H is filled in a preprocessing of the model

50 2015-10-5EECS 73050 Hashing preprocessing example

51 2015-10-5EECS 73051 Recognition Compare the query with the model (several frame systems) Select a basis in the query and define reference frame Find the positions in the reference frame to all the other points For each point r in the query reference system do Calculate the position (x,y) in H that r hashes to Vote one for each model reference system in H(x,y) End Recognize the model reference systems with highest votes Repeat for more query reference systems, if not enough coinciding points are found

52 2015-10-5EECS 73052 Example recognition query (a,c) [(0,0) (3,-2) (8,0)(6,2)(10,4)(3,8)(0, 6)]

53 2015-10-5EECS 73053 Use of several models Can have several models in the same hashtable Must then identify model and reference system in the hash table Example: Have a database of structures, stored in a hashtable

54 2015-10-5EECS 73054 Geometric hashing for structure comparison Need methods invariant under translation and rotation Use geometric hashing to find subsets and coincident residues (points), residues that superpose well 1. Define referance frames Three atoms can be used: a i, a k, a r. Example Origin in a i The x-axis along a i,a k The y-axis in the plane defined of a i, a k, a r in counterclockwise The z-axis orthogonal to the plane 2. The residues may have labels (attributes) 1. Implement the labels explicit: in the hastable 2. Implement the labels implicit: in the hashing


Download ppt "EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science"

Similar presentations


Ads by Google