1 Randomized Algorithms for Three Dimensional Protein Structures Comparison Yaw-Ling Lin Dept Computer Sci and Info Engineering, Providence University,

Slides:



Advertisements
Similar presentations
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Advertisements

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Protein structure (Part 2 of 2).
Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.
Proteins  Proteins control the biological functions of cellular organisms  e.g. metabolism, blood clotting, immune system amino acids  Building blocks.
FLEX* - REVIEW.
Appendix: Automated Methods for Structure Comparison Basic problem: how are any two given structures to be automatically compared in a meaningful way?
The Protein Data Bank (PDB)
Protein Tertiary Structure Comparison Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
Protein threading Structure is better conserved than sequence
Protein structure prediction May 30, 2002 Quiz#4 on June 4 Learning objectives-Understand difference between primary secondary and tertiary structure.
1 Alignment of Flexible Protein Structures Based on: FlexProt: Alignment of Flexible Protein Structures Without a Pre-definition of Hinge Regions / M.
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
BMI 731 Protein Structures and Related Database Searches.
Protein Structure Prediction II
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Or, What is a correspondence set anyway?! Topic 12 Chapter 16, Du and Bourne “Structural Bioinformatics”
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.
By: Z. S. Rezaei. Structural comparison  Structural alignment  spectrum of structural alignment methods  The properties of output  Types of comparison.
Protein Tertiary Structure Prediction
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.
Structure superposition ≠ Structure alignment Lecture 11 Chapter 16, Du and Bourne “Structural Bioinformatics”
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
CS 790 – Bioinformatics Introduction and overview.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Protein Structure Prediction and Structural Genomics Computer Science Department North Dakota State University Fargo, ND.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
DALI Method Distance mAtrix aLIgnment
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Module 3 Protein Structure Database/Structure Analysis Learning objectives Understand how information is stored in PDB Learn how to read a PDB flat file.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Pharm 201 Lecture 10, Reductionism and Classification Require Detailed Comparison Consider 3D Comparison Pharm 201/Bioinformatics I Philip E. Bourne.
New Strategies for Protein Folding Joseph F. Danzer, Derek A. Debe, Matt J. Carlson, William A. Goddard III Materials and Process Simulation Center California.
1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu BIOINFORMATICS Structures Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a (last edit.
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures Rachel Kolodny Patrice Koehl Michael Levitt Stanford University.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Lab Meeting 10/08/20041 SuperPose: A Web Server for Automated Protein Structure Superposition Gary Van Domselaar October.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Local Flexibility Aids Protein Multiple Structure Alignment Matt Menke Bonnie Berger Lenore Cowen.
Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.
1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Several motifs (  -sheet, beta-alpha-beta, helix-loop-helix) combine to form a compact globular.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches.
Chapter 14 Protein Structure Classification
Protein Structure Comparison
Protein Structures.
Protein structure prediction.
DALI Method Distance mAtrix aLIgnment
Presentation transcript:

1 Randomized Algorithms for Three Dimensional Protein Structures Comparison Yaw-Ling Lin Dept Computer Sci and Info Engineering, Providence University, Taiwan WWW:

2 Outline Introduction Protein Structures 3D structure comparisons Algorithms Benchmarking Comparing with other systems Future Works

3 Introduction

4 What are proteins ? Structural framework (keratin, collagen) Transport and storage of small molecules (hemoglobin) Transmit information (hormones, receptors) Antibodies Blood clotting factors Enzymes The protein is created in the cell as a unique sequence of amino acids A C L E V M L C V

5 ACMVLLCEVEKYP… Sequence Structure folding Function?????

The function of 40-50% of the new proteins is unknown. About protein sequences are known today (non-redundant database). This number keeps rapidly growing (large scale sequencing projects). ! Background and Problem definition Understanding biological function is important for: Study of fundamental biological processes Drug design Genetic engineering

7 What bioinformatics can do for us?

8 Drug Discovery Target Identification –Which protein to inhibit? Lead discovery & optimization –What sort of molecule will bind to this protein? Toxicology –Side effects, target specificity Pharmacokinetics –Metabolization and transport

9 Drug Development Life Cycle Years Discovery (2 to 10 Years) Preclinical Testing (Lab and Animal Testing) Phase I (20-30 Healthy Volunteers used to check for safety and dosage) Phase II ( Patient Volunteers used to check for efficacy and side effects) Phase III ( Patient Volunteers used to monitor reactions to long-term drug use) FDA Review & Approval Post-Marketing Testing $ Million! 7 – 15 Years! With the aid of bioinformatics

10 Drug lead screening 5,000 to 10,000 compounds screened 250 Lead Candidates in Preclinical Testing 5 Drug Candidates enter Clinical Testing; 80% Pass Phase I 30%Pass Phase II 80% Pass Phase III One drug approved by the FDA

11 Drug Lead Screening & Docking ? ? Complementarity Shape Chemical Electrostatic

12 Protein Structures

13 Levels of structure in proteins

14 Myoglobin structure

15 Myoglobin structure contd.

16 Myoglobin in solution

17 Three dimensional structures of cytochrome c, lysozyme and ribonuclease

18 PDB file format

19 PDB file format

20 PDB file format

21 PDB file format

22 Protein Structures

23 Rasmol-Structure PDB: 101M PDB: 2DHB

24 Rasmol-Group PDB: 101M PDB: 2DHB

Structural classifications SCOP CATH FSSP Structure comparison algorithms Dali CE Structal VAST

26 Contact matrix and the Dali method Idea: Similar structures have similar contact matrices

27 From distance map to structural similarities Imagine transparent distance map of one protein put on to of a map of other protein (Liisa Holm Chris Sander J. Mol. Biol ): –Matching patches centered on diagonal correspond to matching secondary structures. –Matches of short distances off diagonal correspond to tertiary conformations. –Similarity score Unmatched residues do not contribute to score.

28 Contact matrix and the Dali method Idea: Similar structures have similar contact matrices

29 DALI algorithm outline Step1: Consider all possible pairs of 6x6 submatrices of the contact matrices. Such matrices are small enough that the problem can be solved optimally. Step2: Assembly the alignments from step 1. Method – Monte Carlo algorithm.

CE (Shindyalov & Bourne, Protein Eng. 1998) Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Define alignment fragment pair (AFP) as a continuous segment of protein A aligned against a continuous segment of protein B (without gaps). An alignment is a path of AFPs s.t. for every two consecutive AFPs there may be gaps inserted into either A or B, but not into both. That is, for every two consecutive AFPs i and i+1 and orand where p i A is the starting position of AFP i in protein A

CE What is a “good”AFP? Define the distance between two different AFPs i and j as: d A (p,q) represents the distance between the alpha carbon atoms at positions p and q in protein A. If you already have n-1 AFPs and consider adding the n-th AFN, do so only if Protein A Protein B i j i j

CE (cont.) 1.Select an initial AFP. 2.Build an alignment path by incrementally adding “good” AFPs that satisfy the conditions of paths 3.Repeat step (2) until the proteins are completely matched, or until no good AFPs remain. 4.To assess the significance of the alignment, compare it to the alignment of a random pairs of structures, and compute the Z-score based on the RMSD and number of gaps in the final alignment. Protein A Protein B

Structal (Levitt & Gerstein, PNAS 1998) An initial equivalence is chosen, based on matching the ends of the two structures. Repeat until convergence: Superimpose the two structures so as to minimize the RMS, given the equivalence Given the superposition, calculate the distances d ij between any atom i in the first protein and any atom j in the second protein Transform distances into similarities s ij = M/[1+ (d ij /d 0 ) 2 ] where M=20 and d 0 = 2.24A Apply dynamic programming to define a new set of equivalences

Structal (cont) 1) Alignment fixed 2) Superimpose to minimize RMS 3) Calculate distances between all atoms 4) Use dynamic prog. to find the best set of equivalences 5) Superimpose given the new alignment 6) Recalculate distances between all atoms

35 Approach based on comparing secondary structure arrangement Motivation: Folds are often defined as arrangement of secondary structure elements (sse). Why not to compare arrangement of sse rather than going down to atomic level? 1EJ9: Human topoisomerase

36 VAST- graph theoretical approach Perform the comparison on the level of secondary structures and not residues. Treat each secondary structure as a vector of direction and length corresponding to the direction and length of the secondary structure. Attributes of such vector include the type of secondary structure, number of residues, etc. For two secondary structure provide a way of describing the relative spatial position of secondary structures – distance, angle, etc. VAST finds maximal subset of secondary structures that are in the same relative positions in compared protein structures and in the same order within the structure.

37

38

39

40

41

SCOP Structural classification of proteins with 5 level hierarchy: Domains: the individual entries Family: homologous proteins with significant sequence similarity Superfamily: protein families that share weak sequence similarity but with conserved functional residues (e.g. in active sites) – believed to be evolutionary related Fold: protein superfamilies that share he same fold (not necessarily due to common evolutionary ancestry) Class: all-alpha, all-beta, alpha/beta, alpha+beta, membrane proteins, small proteins The classification is based on manual analysis by experts (Dr. Alexy Murzin) As of May 2002, 7 main classes, 686 folds, 1073 superfamilies, 1827 families

CATH Structural classification of proteins with 5 level hierarchy: Protein chains: the individual entries Homologous superfamily: proteins with highly similar structures and functions. Topology: clusters according to the topological connections and numbers of secondary structures. Architecture: describes the gross orientation of secondary structures, independent of connectivities (assigned manually). Class: derived from secondary structure content, is assigned for more than 90% of protein structures automatically. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons. As of Jan 2002, 8 main classes, 46 architectures, 1453 topologies, more than 2000 superfamilies.

FSSP Structural classification of proteins into a tree hierarchy: Protein domains: the individual entries (defined using the algorithm of Holm and Sander 1994) Start with all-vs-all structure comparison of protein domains Domains are clustered automatically into clusters using the single linkage algorithm based on the z-scores of the structure similarity scores 3242 families of more than 30,000 structures as of June 2002

45 Algorithms Measurement: rmsd. Pair atoms of two structures by minimum bipartite matching. Fix one structure, and keep several 3-D orientations of the other. Randomly perturb these orientations, and shift to better positions until converging. Report the best rmsd score and orientation.

46 INIT-S(N) N=4N=8N=6 N=20N=12

47 INIT-S(N)

48 MB-Align Algorithm

49 MB-Align Descriptions

50 3D Transformation 3D rotation is done around a rotation axis Fundamental rotations  About x, y, or z axes Positive Rotation  Counter-clockwise rotation (when you look down the negative axis) x y z +

51 3D Transformation Rotation about Z x ’ = x cos(  ) – y sin(  ) y ’ = x sin(  ) + y cos(  ) z ’ = z x y z + cos(  ) -sin(  ) 0 0 sin(  ) cos(  ) OpenGL - glRotatef( , 0,0,1)

52 Rotation about Y (z → x, x → y, y → z) z’ = z cos(  ) – x sin(  ) x’ = z sin(  ) + x cos(  ) y’ = y z x y + cos(  ) 0 sin(  ) sin(  ) 0 cos(  ) OpenGL - glRotatef( , 0,1,0) x y z + 3D Transformation

53 Rotation about X (y → x, z → y, x → z) y’ = y cos(  ) – z sin(  ) z’ = y sin(  ) + z cos(  ) x’ = x y z x cos(  ) -sin(  ) 0 0 sin(  ) cos(  ) OpenGL - glRotatef( , 1,0,0) x y z + 3D Transformation

54 Arbitrary rotation axis (rx, ry, rz) glRotatef(angle, rx, ry, rz) So, which way is a positive rotation? x z y (rx, ry, rz) 3D Transformation

55 Rotation

56 Rotation

57 Rotation

58 Rotation

59 Rotation Matrix

60 The orientation vector is perturbed to its neighborhood. Perturbation

61 r, the normal vector.

62 Perturbation Algorithm

63 MB-Align Algorithm

64 System Implementations OS: Linux/Red Hat 7.2 run on Pentium Mhz CPU and 1G bytes RAM. Bioperl – pdb file format conversion Rotation/perturbation/integration – C programs Minimum bipartite matching – LEDA Rmsd - PROFIT

65 Benchmarking

66 Benchmarking

67 Benchmarks

68 Efficiencies of Strategies

69 The End.