Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna
From Sequence to Function Functional Genomics and Proteomics Genomic sequences >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures Protein functions
The Protein Folding T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N
(Rost B.)
The Data Bases of Sequences and Structures >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH EMBL: 195,241,608 sequences 292,078,866,691 nucleotides UNIPROT: sequences 154'416'236 residues PDB: structures membrane proteins 1% November/2009
What is a multiple alignment ? The short answer is this - VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--
1 Y K D Y H S - D K K K G E L Y R D Y Q T - D Q K K G D L Y R D Y Q S - D H K K G E L Y R D Y V S - D H K K G E L Y R D Y Q F - D Q K K G S L Y K D Y N T - H Q K K N E S Y R D Y Q T - D H K K A D L G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K A C D E F G H K I L M N P Q R S T V W Y sequence position Evolutionary information Multiple Sequence Alignment (MSA) of similar sequences Sequence profile: for each position a 20- valued vector contains the aminoacidic composition of the aligned sequences. MSA Sequence profile
New foldsExisting folds Threading Ab initio prediction Building by homology Homology (%) D structure prediction of proteins
Contacts and Contact Maps Contact definition F 297 F 156 V 299 V 271 I 240 V 238 I 269
Protein contact definitions: 1. Based on C 2. Based on C 3. All-atom (without Hydrogens )
From the 3D structure to the contact map Given a protein of length L, and a square matrix M of dimension L L For each pair of residue i and j calculate distance between i and j if distance < threshold put 1 in the cell M(i,j) otherwise put 0 in the cell M(i,j)
From 3D Structure F 297 F 156 V 299 V 271 I 240 V 238 I 269 Computation of Contact Maps To Contact Map TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYANTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Protein Structural Classes All- All- + /
An Example of a Contact map (All- )
An Example of a Contact map (All- ) N C
An Example of Contact map ( ) N C
From the contact map to the 3D structure Two methods have been proposed : 1.Bohr et al., “Protein Structure from distance Inequalities” J.Mol. Biol. 1993, 231: => based on a steepest descent procedure 2.Vendruscolo and Domany Fold. Des. 1998, 2: => based on a modified Metropolis procedure
6pti Reconstruction Efficiency (58 residues) At M= 200 No of eliminated true contacts 6 % real contacts No of added false contacts 52 % real contacts RMSD M (Number of random flipping) Vendruscolo and Domany Fold. Des. 1998
From the contact map to the 3D structure: the reconstruction efficiency
RMSD = 2.5 Å N C Contact map 1QHJ (1.9 Å) 3-D Modelling through Contact Maps example: Bacteriorhodopsin Model
MARC efficiency in 3D reconstruction from the protein contact map after progressive elimination of true contacts (6pti)
MARC efficiency in 3D reconstruction after progressive addition of wrong contacts to a protein contact map with 30 % of true contacts (6pti)
Prediction of Contact Maps
Several methods have been applied: Bohr et al., FEBS :43-46 => based on neural networks Göbel et al., PROTEINS : => based on correlated mutations in proteins Thomas et al., Prot. Eng : => based on a statistical method and evolution information Olmea and Valencia Fold. Des :S25-S32 => based on correlated mutations and other information Fariselli and Casadio Prot. Eng :15-21 => based on neural networks and evolutionary information Fariselli et al., CASP4/ and Prot. Eng. in press => Neural networks and other information Pollastri and Baldi al., Bioinformatics S62-S70 => Recurrent Neural networks
Relevant points Contact Threshold Sequence separation (or sequence gap) No of contacts vs No of non-contacts
The Contact Threshold 16 Å
The Contact Threshold 16 Å 12 Å
The Contact Threshold 16 Å 12 Å 8 Å
The Contact Threshold 16 Å 12 Å 8 Å 6 Å
Sequence separation …VTISCTGSSSNIGAGNHVKWYQQLPG…
The Sequence Separation example of a sequence separation = 10 residues 2
Frequency distribution of the real and hypothetical contacts as a function of sequence separation
Protein length Number of contacts Relation between the number of contacts and the protein length
Evaluation of the efficiency of contact map predictions 1) Accuracy: A = Ncp * / Ncp where Ncp * and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. 2) Improvement over a random predictor : R = A / (Nc/Np) where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts 3) Difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997): Xd= i=1,n (P ic - P ia ) / n d i where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60Å cluster all the possible distances of residue pairs observed in the protein structure); d i is the upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; P ic and P ia are the percentage of predicted contact pairs (with distance between d i and d i-1 ) and that of all possible pairs respectively
Prediction New sequence Prediction Tools out of machine learning approaches Neural Networks Data Base Subset General rules Known mapping TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN Training
Contact definition used: C - C distance < 0.8 nm Sequence gap > 7 residues
The database of proteins used to train and test the contact map predictors.
Neural Network-based predictor 1 output neuron (contact/non-contact) 1 hidden layer with 8 neurons Input layer with 1071 input neurons : Ordered residue pairs (1050 neurons) Secondary structures (18 neurons) Correlated mutations (1 neuron) Sequence conservation (2 neurons)
(A) An alignment of 5 (hypothetical) sequences they are represented in a HSSP file (Sander and Schneider, 1991). i and j stand for the positions of the two residues making or not making contact (A and D in the leading sequence or sequence 1). (B) Single sequence coding. The position representing the couple (AD) in the vector is set to 1.0 while the other positions are set to 0. (C) Multiple sequence coding. For each sequence in the alignment (1 to 5 in the scheme in A) a couple of residues in position i and j is counted. The final input coding representing the frequency of each couple in the alignment is normalized to the number of the sequences Representation of the input coding based on ordered couples.
Multiple sequence alignment 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT N sequences S(T;S) S(T;T) S(S;T) S(I;L) S(I;V) S(L;V) S : McLachlan substitution matrix ViVi VjVj M-valued vectors: Correlation: Correlated mutations i j M = N·(N-1)/2 couples 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT
The neural network architecture for prediction of contact maps
Accuracy of contact map prediction using a cross- validated data set (170 proteins) Accuracy No of proteins
T0087: 310 residues (A = 0.20 FR/NF ) N C
N C T0106: 123 residues (A=0.06 FR / NF )
N C T0128: 222 residues (A = 0.24 CM )
T0110: 128 residues (A = 0.30 FR ) N C
N C T0125: 141 residues (A = 0.03 CM )
C N T0124: 242 residues (A = 0.01 NF)
TARGET: T0115 (300 residues) (A = 0.17 FR/NF) PDB code: 1FWK (Homoserine kinase, Methanococcus jannaschii) C N Sequence position
Predictive performance on 29 targets Q3=secondary structure prediction accuarcy; Fr(H) and Fr(E)= frequency of predicted and observed alfa and beta structures in the chain; Lp=protein length in residues; Nal= number of sequences in the alignment; Xd and A are as defined in equations 2 and 1, respectively; Class is the classification of targets by predictio difficulty: CM=comparative modeling, FR=fold recognition, NF=new fold.
COMMENTS The predictor is trained mainly on globular mixed proteins Contacts among beta structures dominate Contacts in all-alpha proteins are more difficult to predict A filtering algorithm is needed