Madhavi Ganapathiraju Graduate student Carnegie Mellon University TM PRO & Comparison of Algorithms for “Protein Stability Prediction Upon Mutations” Madhavi Ganapathiraju Graduate student Carnegie Mellon University
Overview TMpro evaluations on PDBTM, TMPDB and MPTOPO are complete Additional inputs to TMPro are being studied Yule values (not successful) Evolutionary Profile (promising) TMPro website has been completed Evaluation of algorithms to predict protein stability changes upon mutations
Part 1: TM pro
TMPro Evaluations Segment Residue level Method Qok F Score Recall Precision Q2 Misclassified as Soluble MPtopo (101 TM proteins) 2a TMHMM 66 91 89 94 84 5 2b TMpro NN 60 93 92 79 PDBTM (191 TM proteins) 3a 68 90 13 3b 57 81 2
is fully functional! Competition TMPro web-server is fully functional! Competition for TMpro Logo Prize: See your logo on the web!
Attempts to overcome confusion with globular soluble helices (1) Yule value features to be added Yule value features that discriminate amino acid neighbor propensities between TM and nonTM helices were computed earlier Tried to add these features as input to NN predictor, but could not achieve quantitative improvement I will discuss this in future when I have any results to present
Attempts to overcome confusion with globular soluble helices (2) Evolutionary profile information It is known that knowledge of evolutionary profile of a protein can improve prediction accuracy to a great extent TMPro is capable of predicting TMs without requiring knowledge of profile Useful when you cannot extract sequence alignments from known proteins But where profile is known, we would like to use that additional information
Profile generation Get multiple sequence alignments Those of you who have worked with evolutionary analysis before, please give feedback Get multiple sequence alignments Compute position specific scoring matrix for each protein 21 rows (20 amino acids, and 1 row for gaps) Profile is generated for each protein in the training and test sets PSSM (i,j) = log(C(i,j)/total counts at position j) log(C(i,j)/unigram count of i in the protein)
Doubts We have labels for training sequences What labels to assign to gaps? We have labels for training sequences But when original sequence has gaps when aligned, how to interpret the labels of the gaps? --n------n----n------nnn-----n------n-----------------M----- 2a65 369 --D------E----L------KLS-----R------K-----------------H----- 377 2A65_A 369 --.------.----.------...-----.------.-----------------.----- 377 AAC07817 369 --.------.----.------...-----.------.-----------------.----- 377 YP_001956 364 --E------S----F------G.K-----.------.-----------------T----- 372 -M------M------M------M-------M----------M---------MM------- 2a65 378 -A------V------L------W-------T----------A---------AI------- 385 2A65_A 378 -.------.------.------.-------.----------.---------..------- 385 AAC07817 378 -.------.------.------.-------.----------.---------..------- 385 YP_001956 373 -S------C------.-----------------------------------IL------- 377 Even TM regions are having gaps such as shown above
What do with missing segment info for some sequences Doubts What do with missing segment info for some sequences When nothing is shown (gap/alignment) for some sequences, I am counting those as gaps XP_659910 47 L-......K.----------...KAP----RSNQV.-..FVAGTMGLASAVGA.AT 86 AAW43619 100 .....A..A-----------KNP----NTTRNV-..FMVGALGALGASSV.ST 136 CAB59195 59 ----.N.RP.-A..VIGSARFAYMAWTRVA 83 XP_466001 107 SKRA.-A.FVLSGGRFIYASLLRLL 130 AAA20832 103 SKRA.-A.FVLTGGRFVYASLVRLL 126
Using profile for prediction Studied independent of TMpro Neural network with 21 input, 21 hidden and 1 output neurons Residue Number (nonmembrane=0, membrane =1) Predicted output Experimental observed locations of TM helices
Another output
NN architecture needs to be modified But instead I did post-processing of Neural network output Computed Wavelet Transform Mexican hat wavelet, scale = 10
Some more wavelet outputs Note that these are from the training data itself.. Yet to check how it performs overall
Part 2: Stability upon Mutations
Evaluation of predictions of protein stability changes upon mutations Effects of mutations on 2 TM proteins are available in our group The two proteins are rhodopsin and bacteriorhodopsin Data available for how much mis-folding occurs How stability of protein is affected There are algorithms that can also predict these changes We compared how accurate or reliable the prediction methods are, by comparing their results with our experimental data
3 Prediction algorithms I mutant 2.0 Support vector machine Features: amino acid neighbors in 9nm sphere, temperature, pH, relative solvent accessibility surface are http://gpcr2.biocomp.unibo.it/cgi/predictors/I-Mutant2.0/I-Mutant2.0.cgi DFIRE Knowledge based statistical potentials http://phyyz4.med.buffalo.edu/hzhou/mutation.html FOLDX Statistical mechanics.. Account for various energy terms http://fold-x.embl-heidelberg.de:1100/
Authors’ claims in 3 papers
Our results Rhodopsin (PDB: 1U19) Bacteriorhodopsin (PDB: 1QM8)
Bias in # of mutations that increase/decrease stability Database bias affects apparent accuracies of algorithms I-mutant for example, predicts decrease in stability for a majority of the mutations. Whether the mutations studied through experiments preserve the natural bias of decreasing stability mutations, affects the apparent accuracy of the prediction algorithms
Correlation with known data Reported correlations for these methods are quite large (>0.7) On data compared here the correlations are quite low
Notes .. Local installation of blast and netblast are on cologne: /usr1/blast-2.2.13/ /usr1/netblast-2.2.13/ Java SDK on Cologne /usr1/j2sdk1.4.2_11/
Acknowledgements Judith Klein-Seetharaman Christopher Jon Jursa Pitt Information sciences (for developing web interface)