IT og Sundhed 2010/11 Sequence based predictors. Secondary structure and surface accessibility Bent Petersen 13 January 2011
NetSurfP Real Value Solvent Accessibility predictions with amino acid associated reliability
Objective Predict residues as being either buried or exposed (25 % threshold) Two states/classes, Buried/Exposed Predict the Relative Solvent Accessibility, RSA “Real” Value
What is ASA? Accessible Solvent Area, Å 2 Surface area accessible to a rolling water molecule
RSA RSA = Relative Solvent Accessibility ACC = Accessible area in protein structure ASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala Classification Networks“Real” value Networks Classification: Buried = RSA 25 % “Real” Value: values 0 - 1, RSA > 1 set to 1
Why predict RSA? Residues exposed on surface can be: Involved in PTM’s Potential epitopes Involved in Protein-Protein interactions Prediction of Disease-SNP’s
How to start? What do we want? We want to be able to predict the exposure of an AA What do we need? A training dataset and an independent evaluation dataset What information do we need? True structural information the Neural Network can train on Where do we get that? PDB, DSSP
Protein Data Bank, PDB Berman, H.M., et al., The Protein Data Bank. Nucl. Acids Res., (1): p
Define Secondary Structure of Proteins, DSSP Kabsch, W. and C. Sander, Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, (12): p ==== Secondary Structure Definition by the program DSSP, updated CMBI version by ElmK / April 1,2000 ==== DATE=23-MAR REFERENCE W. KABSCH AND C.SANDER, BIOPOLYMERS 22 (1983) HEADER TOXIN 12-AUG-98 3BTA. COMPND 2 MOLECULE: PROTEIN (BOTULINUM NEUROTOXIN TYPE A);. SOURCE 2 ORGANISM_SCIENTIFIC: CLOSTRIDIUM BOTULINUM;. AUTHOR R.C.STEVENS,D.B.LACY TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS, NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN) ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2) TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS IN PARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-5), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-4), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+2), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+3), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+5), SAME NUMBER PER 100 RESIDUES *** HISTOGRAMS OF *** RESIDUES PER ALPHA HELIX PARALLEL BRIDGES PER LADDER ANTIPARALLEL BRIDGES PER LADDER LADDERS PER SHEET. # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 1 A P , 0.0 2,-3.8 0, 0.0 3, A F , ,-0.1 1, , A V , , , , A N S S ,-0.3 2, , , A K S S ,-0.1 2,-0.5 1, , A Q ,-0.5 2,-0.1 1, , A F ,-0.5 2, ,-0.1 3, A N > ,-0.1 3,-0.9 1, , A Y T 3 S , ,-0.1 1, , A K T 3 S , , ,-0.1 3, A D S < S ,-0.9 3,-0.1 1,-0.1 2, A P , ,-0.1 0, , A V ,-0.1 6,-0.2 1,-0.1 4, A N ,-3.7 2,-1.4 2,-0.2 5, A G S S ,-0.4 2,-0.3 3,-0.2 4, A V S S , ,-0.2 2, , A D S S , , ,-0.1 2, A I E S+A 35 0A 6 17, , , ,
Define Secondary Structure of Proteins, DSSP DSSP defines 8 types of secondary structure G = 3-turn helix (3-10 helix) H = 4-turn helix (α-helix) I = 5-turn helix (π-helix) T = Hydrogen bonded turn (3, 4 or 5 turn) E = Extended strand B = Residue in isolated β-bridge S = Bend Rest is C = coil
Required datasets Training/test Used for optimization of settings using 10-fold cross- validation Evaluation Used for final evaluation, less than 25 % homolog to the training/test dataset.
10-fold Cross Validation Break dataset into 10 sets of size 1/10 Train on 9 datasets and test on 1 Repeat 10 times and take a mean accuracy
Learning / Training dataset Training set: Cull_1764: Max. Seq. ID: 25 % Resolution: ≤ 2.0 Å R-Factor: ≤ 0.2 Seq. Length AA Including X-ray entries only
PISCES
Learning / Training dataset Homology reduced towards evaluation set CB513 (302 sequences removed) Final Training set: 1764 sequences amino acids ‣ Buried: % ( amino acids) ‣ Exposed: % ( amino acids)
Learning / Training dataset ---Sequence/residue statistics--- Number of seq.: 1764 Longest seq.: 1T3T.A (1283) Shortest seq.: 1YTV.M(6) Number of amino acids: Assignment category statistics --- B ( 44.20%) A ( 55.80%) ---Amino acid statistics--- H ( 2.40%) G ( 7.59%) Y ( 3.57%) V ( 7.22%) E ( 6.64%) S ( 5.84%) P ( 4.69%) A ( 8.53%) R ( 5.13%) Q ( 3.72%) C 5202 ( 1.24%) K ( 5.52%) L ( 9.21%) N ( 4.25%) T ( 5.50%) F ( 4.11%) D ( 5.92%) I ( 5.63%) W 6365 ( 1.52%) M 7353 ( 1.76%)
Evaluation dataset Final Evaluation dataset: CB513: 513 non-homologous sequences Seq. Length aa amino acids Buried: % ( amino acids) Exposed: % ( amino acids)
Evaluation dataset ---Sequence/residue statistics--- Number of seq.: 513 Longest seq.: 6acn.all(754) Shortest seq.: 1atpi-1(20) Number of amino acids: Assignment category statistics --- B ( 44.19%) A ( 55.81%) ---Amino acid statistics--- R 3812 ( 4.53%) T 5015 ( 5.96%) D 4973 ( 5.91%) C 1381 ( 1.64%) Y 3065 ( 3.64%) G 6657 ( 7.91%) N 3976 ( 4.73%) V 5795 ( 6.89%) I 4642 ( 5.52%) A 7267 ( 8.64%) S 5222 ( 6.21%) K 4976 ( 5.92%) P 3903 ( 4.64%) E 5050 ( 6.00%) L 7134 ( 8.48%) Q 3108 ( 3.69%) M 1710 ( 2.03%) H 1865 ( 2.22%) W 1236 ( 1.47%) F 3268 ( 3.88%) X 19 ( 0.02%) B 31 ( 0.04%) Z 14 ( 0.02%)
Neural Network - Input Position Specific Scoring Matrices, PSSM A R N D C Q E G H I L K M F P S T W Y V B H 2BEM.A A G 2BEM.A A Y 2BEM.A A V 2BEM.A B E 2BEM.A time iterativ psi-blast against nr70 Secondary Structure predictions B H 2BEM.A A G 2BEM.A A Y 2BEM.A A V 2BEM.A B E 2BEM.A (sec predictor by Pernille Andersen)
Secondary structure predictor Developed by Pernille Andersen, incorporated in NetSurfP Trained on 2,085 sequences using DSSP H = H, E = E, C =., G, I, B, S and T H ~ 30 %, E ~ 20 %, C ~ 50 % Performance of ~80 % Maximum theoretical limit is ~88 %
Neural Network - Settings Window Size: Hidden units: 10, 20, 25, 30, 40, 50, 75, 150, (200) Learning rate: 0.01 / (0.005) Epocs (training rounds): fold cross-validation 9/10 used for training, 1/10 for testing
Neural network window Sliding window of BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Serine, buried
Neural network window Sliding window of BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Proline, exposed
Neural network window Sliding window of BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Alanine, exposed
Method
Error function: Z-score:
Wisdom of the crowd Selecting best performing network architectures based on test performance Better than choosing any single network
Results - Classification networks Training: % CorrectMCC#Networks Best Single Architecture All Architectures Top 20 Architectures
Results - Classification networks Training: Evaluation: % CorrectMCC#Networks Best Single Architecture All Architectures Top 20 Architectures % CorrectMCC Dor and Zhou78.8Not Published NetsurfP CB500/CB
Results Evaluation
NetSurfP /usr/cbs/bio/src/NetSurfP/NetSurfP -h
NetSurfP
NetDiseaseSNP Disease-SNP prediction (Morten Bo Johansen) Without NetSurfP: Cross-validation: MCC= Cross-Evaluation: MCC= With NetSurfP: Cross-validation: MCC= Cross-Evaluation: MCC= 0.572
Paper is out..What then?
Statistics Submissions to the webserver from CBS website
Paper is out..What then?
As of 12 Jan sequences submitted from unique IP’s
First citation 24 october 2009 :-)