Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna.

Slides:



Advertisements
Similar presentations
Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, Parallel Computational Biochemistry.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological.
Protein Structure Prediction using ROSETTA
Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi Institute for Genomics.
Protein Tertiary Structure Prediction
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Protein threading algorithms 1.GenTHREADER Jones, D. T. JMB(1999) 287, Protein Fold Recognition by Prediction-based Threading Rost, B., Schneider,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Protein structure (Part 2 of 2).
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
The Protein Data Bank (PDB)
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Protein Tertiary Structure Prediction
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Frank Dehnewww.dehne.net Parallel Computational Biochemistry.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Representations of Molecular Structure: Bonds Only.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Secondary structure prediction
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Expected accuracy sequence alignment Usman Roshan.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
Emidio Capriotti, Piero Fariselli and Rita Casadio Biocomputing Unit
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Protein Tertiary Structure Prediction Structural Bioinformatics.
“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Madhavi Ganapathiraju Graduate student Carnegie Mellon University
Neural Machine Translation by Jointly Learning to Align and Translate
Classification with Perceptrons Reading:
A Hybrid Algorithm for Multiple DNA Sequence Alignment
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Predicting Active Site Residue Annotations in the Pfam Database
Protein Structures.
network of simple neuron-like computing elements
Prediction of protein function from sequence analysis
Protein structure prediction.
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
Prediction of the Number of Residue Contacts in Proteins
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna

From Sequence to Function Functional Genomics and Proteomics Genomic sequences >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures Protein functions

The Protein Folding T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N

(Rost B.)

The Data Bases of Sequences and Structures >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH EMBL: 195,241,608 sequences 292,078,866,691 nucleotides UNIPROT: sequences 154'416'236 residues PDB: structures membrane proteins  1% November/2009

What is a multiple alignment ? The short answer is this - VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

1 Y K D Y H S - D K K K G E L Y R D Y Q T - D Q K K G D L Y R D Y Q S - D H K K G E L Y R D Y V S - D H K K G E L Y R D Y Q F - D Q K K G S L Y K D Y N T - H Q K K N E S Y R D Y Q T - D H K K A D L G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K A C D E F G H K I L M N P Q R S T V W Y sequence position Evolutionary information Multiple Sequence Alignment (MSA) of similar sequences Sequence profile: for each position a 20- valued vector contains the aminoacidic composition of the aligned sequences. MSA Sequence profile

New foldsExisting folds Threading Ab initio prediction Building by homology Homology (%) D structure prediction of proteins

Contacts and Contact Maps Contact definition F 297 F 156 V 299 V 271 I 240 V 238 I 269

Protein contact definitions: 1. Based on C  2. Based on C  3. All-atom (without Hydrogens )

From the 3D structure to the contact map Given a protein of length L, and a square matrix M of dimension L  L For each pair of residue i and j calculate distance between i and j if distance < threshold put 1 in the cell M(i,j) otherwise put 0 in the cell M(i,j)

From 3D Structure F 297 F 156 V 299 V 271 I 240 V 238 I 269 Computation of Contact Maps To Contact Map TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYANTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Protein Structural Classes All-  All-   +   / 

An Example of a Contact map (All-  )

An Example of a Contact map (All-  ) N C

An Example of Contact map (  ) N C

From the contact map to the 3D structure Two methods have been proposed : 1.Bohr et al., “Protein Structure from distance Inequalities” J.Mol. Biol. 1993, 231: => based on a steepest descent procedure 2.Vendruscolo and Domany Fold. Des. 1998, 2: => based on a modified Metropolis procedure

6pti Reconstruction Efficiency (58 residues) At M= 200 No of eliminated true contacts  6 % real contacts No of added false contacts  52 % real contacts RMSD M (Number of random flipping) Vendruscolo and Domany Fold. Des. 1998

From the contact map to the 3D structure: the reconstruction efficiency

RMSD = 2.5 Å N C Contact map 1QHJ (1.9 Å) 3-D Modelling through Contact Maps example: Bacteriorhodopsin Model

MARC efficiency in 3D reconstruction from the protein contact map after progressive elimination of true contacts (6pti)

MARC efficiency in 3D reconstruction after progressive addition of wrong contacts to a protein contact map with 30 % of true contacts (6pti)

Prediction of Contact Maps

Several methods have been applied: Bohr et al., FEBS :43-46 => based on neural networks Göbel et al., PROTEINS : => based on correlated mutations in proteins Thomas et al., Prot. Eng : => based on a statistical method and evolution information Olmea and Valencia Fold. Des :S25-S32 => based on correlated mutations and other information Fariselli and Casadio Prot. Eng :15-21 => based on neural networks and evolutionary information Fariselli et al., CASP4/ and Prot. Eng. in press => Neural networks and other information Pollastri and Baldi al., Bioinformatics S62-S70 => Recurrent Neural networks

Relevant points Contact Threshold Sequence separation (or sequence gap) No of contacts vs No of non-contacts

The Contact Threshold 16 Å

The Contact Threshold 16 Å 12 Å

The Contact Threshold 16 Å 12 Å 8 Å

The Contact Threshold 16 Å 12 Å 8 Å 6 Å

Sequence separation …VTISCTGSSSNIGAGNHVKWYQQLPG…

The Sequence Separation example of a sequence separation = 10 residues 2

Frequency distribution of the real and hypothetical contacts as a function of sequence separation

Protein length Number of contacts Relation between the number of contacts and the protein length

Evaluation of the efficiency of contact map predictions 1) Accuracy: A = Ncp * / Ncp where Ncp * and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. 2) Improvement over a random predictor : R = A / (Nc/Np) where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts 3) Difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997): Xd=  i=1,n (P ic - P ia ) / n d i where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60Å cluster all the possible distances of residue pairs observed in the protein structure); d i is the upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; P ic and P ia are the percentage of predicted contact pairs (with distance between d i and d i-1 ) and that of all possible pairs respectively

Prediction New sequence Prediction Tools out of machine learning approaches Neural Networks Data Base Subset General rules Known mapping TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN Training

Contact definition used: C  - C  distance < 0.8 nm Sequence gap > 7 residues

The database of proteins used to train and test the contact map predictors.

Neural Network-based predictor 1 output neuron (contact/non-contact) 1 hidden layer with 8 neurons Input layer with 1071 input neurons : Ordered residue pairs (1050 neurons) Secondary structures (18 neurons) Correlated mutations (1 neuron) Sequence conservation (2 neurons)

(A) An alignment of 5 (hypothetical) sequences they are represented in a HSSP file (Sander and Schneider, 1991). i and j stand for the positions of the two residues making or not making contact (A and D in the leading sequence or sequence 1). (B) Single sequence coding. The position representing the couple (AD) in the vector is set to 1.0 while the other positions are set to 0. (C) Multiple sequence coding. For each sequence in the alignment (1 to 5 in the scheme in A) a couple of residues in position i and j is counted. The final input coding representing the frequency of each couple in the alignment is normalized to the number of the sequences Representation of the input coding based on ordered couples.

Multiple sequence alignment 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT N sequences S(T;S) S(T;T) S(S;T) S(I;L) S(I;V) S(L;V) S : McLachlan substitution matrix ViVi VjVj M-valued vectors: Correlation: Correlated mutations i j M = N·(N-1)/2 couples 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT

The neural network architecture for prediction of contact maps

Accuracy of contact map prediction using a cross- validated data set (170 proteins) Accuracy No of proteins

T0087: 310 residues (A = 0.20 FR/NF ) N C

N C T0106: 123 residues (A=0.06 FR / NF )

N C T0128: 222 residues (A = 0.24 CM )

T0110: 128 residues (A = 0.30 FR ) N C

N C T0125: 141 residues (A = 0.03 CM )

C N T0124: 242 residues (A = 0.01 NF)

TARGET: T0115 (300 residues) (A = 0.17 FR/NF) PDB code: 1FWK (Homoserine kinase, Methanococcus jannaschii) C N Sequence position

Predictive performance on 29 targets Q3=secondary structure prediction accuarcy; Fr(H) and Fr(E)= frequency of predicted and observed alfa and beta structures in the chain; Lp=protein length in residues; Nal= number of sequences in the alignment; Xd and A are as defined in equations 2 and 1, respectively; Class is the classification of targets by predictio difficulty: CM=comparative modeling, FR=fold recognition, NF=new fold.

COMMENTS The predictor is trained mainly on globular mixed proteins Contacts among beta structures dominate Contacts in all-alpha proteins are more difficult to predict A filtering algorithm is needed