Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna.

Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna

From Sequence to Function Functional Genomics and Proteomics Genomic sequences >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures Protein functions

The Protein Folding T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N

(Rost B.) http://dodo.cpmc.columbia.edu/cubic/papers/

The Data Bases of Sequences and Structures >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH EMBL: 195,241,608 sequences 292,078,866,691 nucleotides UNIPROT: 428 650 sequences 154'416'236 residues PDB: 68000 structures membrane proteins  1% November/2009

What is a multiple alignment ? The short answer is this - VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 sequence position Evolutionary information Multiple Sequence Alignment (MSA) of similar sequences Sequence profile: for each position a 20- valued vector contains the aminoacidic composition of the aligned sequences. MSA Sequence profile

New foldsExisting folds Threading Ab initio prediction Building by homology Homology (%) 0 10 20 30 40 50 60 70 80 90 100 3D structure prediction of proteins

Contacts and Contact Maps Contact definition F 297 F 156 V 299 V 271 I 240 V 238 I 269

Protein contact definitions: 1. Based on C  2. Based on C  3. All-atom (without Hydrogens )

From the 3D structure to the contact map Given a protein of length L, and a square matrix M of dimension L  L For each pair of residue i and j calculate distance between i and j if distance < threshold put 1 in the cell M(i,j) otherwise put 0 in the cell M(i,j)

From 3D Structure F 297 F 156 V 299 V 271 I 240 V 238 I 269 Computation of Contact Maps To Contact Map TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYANTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Protein Structural Classes All-  All-   +   / 

An Example of a Contact map (All-  ) 1 2 3 4 2 1 3 4

An Example of a Contact map (All-  ) N C

An Example of Contact map (  ) N C

From the contact map to the 3D structure Two methods have been proposed : 1.Bohr et al., “Protein Structure from distance Inequalities” J.Mol. Biol. 1993, 231:861-869 => based on a steepest descent procedure 2.Vendruscolo and Domany Fold. Des. 1998, 2:295-306 => based on a modified Metropolis procedure

6pti Reconstruction Efficiency (58 residues) At M= 200 No of eliminated true contacts  6 % real contacts No of added false contacts  52 % real contacts RMSD M (Number of random flipping) Vendruscolo and Domany Fold. Des. 1998

From the contact map to the 3D structure: the reconstruction efficiency

RMSD = 2.5 Å N C Contact map 1QHJ (1.9 Å) 3-D Modelling through Contact Maps example: Bacteriorhodopsin Model

MARC efficiency in 3D reconstruction from the protein contact map after progressive elimination of true contacts (6pti)

MARC efficiency in 3D reconstruction after progressive addition of wrong contacts to a protein contact map with 30 % of true contacts (6pti)

Prediction of Contact Maps

Several methods have been applied: Bohr et al., FEBS 1990 261:43-46 => based on neural networks Göbel et al., PROTEINS 1994 18: 309-317 => based on correlated mutations in proteins Thomas et al., Prot. Eng. 1996 9: 941-948 => based on a statistical method and evolution information Olmea and Valencia Fold. Des. 1997 2:S25-S32 => based on correlated mutations and other information Fariselli and Casadio Prot. Eng 1999 12:15-21 => based on neural networks and evolutionary information Fariselli et al., CASP4/ and Prot. Eng. in press => Neural networks and other information Pollastri and Baldi al., Bioinformatics 2002 18 S62-S70 => Recurrent Neural networks

Relevant points Contact Threshold Sequence separation (or sequence gap) No of contacts vs No of non-contacts

The Contact Threshold 16 Å

The Contact Threshold 16 Å 12 Å

The Contact Threshold 16 Å 12 Å 8 Å

The Contact Threshold 16 Å 12 Å 8 Å 6 Å

Sequence separation 1 100 20 40 …VTISCTGSSSNIGAGNHVKWYQQLPG…

The Sequence Separation example of a sequence separation = 10 residues 2

Frequency distribution of the real and hypothetical contacts as a function of sequence separation

Protein length Number of contacts Relation between the number of contacts and the protein length

Evaluation of the efficiency of contact map predictions 1) Accuracy: A = Ncp * / Ncp where Ncp * and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. 2) Improvement over a random predictor : R = A / (Nc/Np) where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts 3) Difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997): Xd=  i=1,n (P ic - P ia ) / n d i where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60Å cluster all the possible distances of residue pairs observed in the protein structure); d i is the upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; P ic and P ia are the percentage of predicted contact pairs (with distance between d i and d i-1 ) and that of all possible pairs respectively

Prediction New sequence Prediction Tools out of machine learning approaches Neural Networks Data Base Subset General rules Known mapping TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN Training

Contact definition used: C  - C  distance < 0.8 nm Sequence gap > 7 residues

The database of proteins used to train and test the contact map predictors.

Neural Network-based predictor 1 output neuron (contact/non-contact) 1 hidden layer with 8 neurons Input layer with 1071 input neurons : Ordered residue pairs (1050 neurons) Secondary structures (18 neurons) Correlated mutations (1 neuron) Sequence conservation (2 neurons)

(A) An alignment of 5 (hypothetical) sequences they are represented in a HSSP file (Sander and Schneider, 1991). i and j stand for the positions of the two residues making or not making contact (A and D in the leading sequence or sequence 1). (B) Single sequence coding. The position representing the couple (AD) in the vector is set to 1.0 while the other positions are set to 0. (C) Multiple sequence coding. For each sequence in the alignment (1 to 5 in the scheme in A) a couple of residues in position i and j is counted. The final input coding representing the frequency of each couple in the alignment is normalized to the number of the sequences Representation of the input coding based on ordered couples.

Multiple sequence alignment 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT N sequences S(T;S) S(T;T) S(S;T) S(I;L) S(I;V) S(L;V) S : McLachlan substitution matrix ViVi VjVj M-valued vectors: Correlation: Correlated mutations i j M = N·(N-1)/2 couples 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 1 MVKGPGLYTDIGKKARDLLYKDYHSDKKFTISTYSPTGVAITSS 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT 2 MVKGPGLYSDIGKRARDLLYRDYQSDHKFTLTTYTANGVAITST 3 MVKGPGLYTEIGKKARDLLYRDYQGDQKFSVTTYSSTGVAITTT

The neural network architecture for prediction of contact maps

Accuracy of contact map prediction using a cross- validated data set (170 proteins) Accuracy No of proteins

T0087: 310 residues (A = 0.20 FR/NF ) N C

N C T0106: 123 residues (A=0.06 FR / NF )

N C T0128: 222 residues (A = 0.24 CM )

T0110: 128 residues (A = 0.30 FR ) N C

N C T0125: 141 residues (A = 0.03 CM )

C N T0124: 242 residues (A = 0.01 NF)

TARGET: T0115 (300 residues) (A = 0.17 FR/NF) PDB code: 1FWK (Homoserine kinase, Methanococcus jannaschii) C N Sequence position

Predictive performance on 29 targets Q3=secondary structure prediction accuarcy; Fr(H) and Fr(E)= frequency of predicted and observed alfa and beta structures in the chain; Lp=protein length in residues; Nal= number of sequences in the alignment; Xd and A are as defined in equations 2 and 1, respectively; Class is the classification of targets by predictio difficulty: CM=comparative modeling, FR=fold recognition, NF=new fold.

COMMENTS The predictor is trained mainly on globular mixed proteins Contacts among beta structures dominate Contacts in all-alpha proteins are more difficult to predict A filtering algorithm is needed

Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna.

Similar presentations

Presentation on theme: "Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna.

Similar presentations

Presentation on theme: "Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna."— Presentation transcript:

Similar presentations

About project

Feedback