Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics & Computer Science Central State University Wilberforce, Ohio Yaohang Li Computer Science Department Old Dominion University Norfolk, Virginia BIOT 2013: Biotechnology and Bioinformatics Symposium
Contents Introduction Research Objective Background Method Protein data sets Context-based features Neural Network model Results Summary 2
Introduction 3 The solvent-accessible surface area, or accessibility, of a residue is the surface area of the residue that is exposed to solvent. The residue accessibility is a useful indicator to the residue's location, on the surface or in the core Surface area of a protein segment
Introduction-cont. 4 DSSP program calculates the absolute solvent accessibility values of proteins Relative values are calculated as the ratio between the absolute solvent accessibility value and that in an extended tripeptide (Ala-X-Ala) conformation To allow comparisons between the accessibility of the different amino acids in proteins A threshold of 0.25 to define 2-state (exposed if >0.25, buried otherwise)
Prediction effectiveness 5 Residue solvent accessibility plays an important role in folding and enhancing proteins’ thermodynamic and mechanical stability The burial of residues at core (hydrophobic residues) is a major driving force for folding Active sites of proteins are located on its surface. Reduce the conformational space to aid modeling protein structures in three dimensions Help predict important protein functions
Predicting Structural Features in Protein Modeling 6 Protein Modeling Correctly predicting structural features is a critical step stone to obtain correct 3D models Sequence 3D intermediate prediction steps
Protein Structural Features 7 Protein 1BOO Chain A Secondary Structure: General 3D form of local segments of residues Disulfide bond in protein chain Surface area of a protein segment Properties of the residues in proteins
Background 8 Many methods using different protein datasets and different computational methods, Neural networks, support vector machines, nearest neighbor, information theory, and Bayesian statistics The prediction is in a discrete fashion Significant accuracy increase when using evolutionary information 2-state prediction accuracy of ~75% with 0.25 threshold PSI-BLAST derived profiles 2-state prediction accuracy of ~78%
Background-cont. 9 Secondary Structure Prediction 3-state (helix, sheet, coil) 8-state ( α -helix, π -helix, helix, β -strand, β -bridge, turn, bend and others) Residue Solvent Accessibility Prediction 2-state (buried or exposed) Predictor Structural feature (state) of Ri Disulfide Bonding Prediction Stage1: Bonding state prediction (bonded/free) Stage2: Connectivity prediction (connected, not connected) Structural features prediction classification Each residue is predicted to be in one of few states Machine Learning (ANN, SVM, HMM,...)
Statement of the Problem 10 The improvement of prediction methods benefits from the incorporation of effective features MSA in machine learning The accuracy of current prediction methods is stagnated for the past few years 2-state solvent accessibility ~78% 3-state secondary structure ~76-80% 8-state secondary structure ~68%
Statement of the Problem-cont. 11 How to continuously improve the accuracy of predicting protein structural features toward their theoretical upper bounds? Reducing the inaccuracy of protein structural features prediction, will be very useful in improving the efficiency of protein tertiary structure prediction the search space for finding a tertiary structure goes up super-linearly with the fraction of inaccuracy in structural feature prediction
HH X Our Approach 12 Extracting and selecting “good” features can significantly enhance the prediction performance Probably the most effective features, when predicting the structural state of a residue, are the structural states of the neighboring residues With true states >90% RiRi H H C C B Solvent Accessibility B: Buried E: Exposed Secondary Structure H: Helix E: Sheet C: Coil B B B B
Our Approach-cont. 13 Unfortunately, using the true structural states as features is not feasible However, this inspires us that the favorability of a residue adopting a certain structural state can be also an effective feature Statistical scores measuring the favorability of a residue adopting a certain structural state within its amino acid environment can be evaluated from the experimentally determined protein structures in (PDB)
Our Approach-cont. 14 Predictor Structural feature (state) of Ri Input encoding Sequence & evolutionary info (MSA) + Structure info (context-based scores) We expect that our approaches will improve the predictions of protein structural features with the goal of achieving high accuracy levels
Method Context-based features potential scores calculated based on the context- based statistics, derived from the protein datasets estimate the favorability of residues in adopting specific structural states, within their amino acid environment. 15 Context-based Model
Context-based Statistics & Potentials 16 RiRi X RiRi CiCi CiCi YRiRi X CiCi
Encoding & Neural Network Model 17
Results 18 CASP9Manesh215Carugo338 NETASA Q2Q QBQB QEQE Sable t=0.2 Q2Q QBQB QEQE Sable t=0.3 Q2Q QBQB QEQE Netsurf Q2Q QBQB QEQE SPINE Q2Q QBQB QEQE ACCpro Q2Q QBQB QEQE Casa Q2Q QBQB QEQE C OMPARISON OF Q2 ACCURACY BETWEEN OUR AND OTHER POPULARLY USED S OLVENT A CCESSIBILITY PREDICTION SERVERS C OMPARISON OF PREDICTION PERFORMANCE OF S OLVENT A CCESSIBILITY USING PSSM ONLY AND PSSM WITH CONTEXT - BASED SCORES ON C ULL USING 7- FOLD CROSS VALIDATION QBQB QEQE Q2Q2 PSSM Only 78.44%80.61%79.50% PSSM+Score 79.21%82.00%80.76% Q B and Q E to measure the quality of predicting the buried state and the exposed state respectively Q 2 = total number of residues correctly predicted /total number of residues
Results-cont. 19 DAVMVFARQGDKGSVSVGDKHFRTQAFKVRLVNAAKSEISLKNSCLVAQSAAGQSFRLDTVDEELTADTLKPGASVEGDAIFASEDDAVYGASLVRLSDRCK 3NRF-A EEB.BEBEEEEEEEEEEEEEEEEBBBBEBEBBBEBEEEBEBEEEBBBBBBEEEEEBEEEEEEEEBEEEEBEEEEEBEBEBEBBBEEEBBEEBBBBBBBEEEE DSSP SA2 EEB.BBBBEEEEBBBBEEEEEEBBBEBEBBBBEBEEEEBEBEEBBBBBBBEEEEEBEBEEBEEEBEEEBBEEEEEBEBBBBBBBEEEEBBEBEBBEBBEEBE PSSM Only EEB.BBBEEEEEEEBEEEEEEBBBBEBEBBBBEEEEEEBEBEEBBBBBBBEEEEEBEBEEBEEEBEEEEBEEEEEBEBBBBBBBEEEBBBEBBBBEBBEEBE PSSM+Score Solvent Accessibility Prediction on protein 3NRF(A) Q2
20 Working with Casa Input title Input your sequence Input your Submit, then wait for the results... “Casa” available at:
21 Working with Casa Check your , Click the link provided The results are displayed
Summary The effectiveness of using context-based features has been demonstrated in our computational results in N-fold cross validation as well as on benchmarks, where enhancements of prediction accuracies in secondary structures, disulfide bond and solvent accessibility are observed. Web servers implementing our prediction methods are currently available. Dinosolve, available at C3-Scorpion, available at: C8-Scorpion, available at: Casa, available at: 22
Publications 23 Publication 1 Ashraf Yaseen and Yaohang Li “Enhancing Protein Disulfide Bonding Prediction Accuracy with Context- based Features”, Proceedings of Biotechnology and Bioinformatics Symposium, (BIOT2012), Provo, Ashraf Yaseen and Yaohang Li, "Dinosolve: A Protein Disulfide Bonding Prediction Server using Context- based Features to Enhance Prediction Accuracy". Accepted, BMC Bioinformatics Ashraf Yaseen and Yaohang Li “Template-based Prediction of Protein 8-state Secondary structures”. 3 rd IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), New Orleans, April Accepted, BMC Bioinformatics 4 Ashraf Yaseen and Yaohang Li “Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features”, Accepted at BIOT Ashraf Yaseen and Yaohang Li “Context-based features can enhance protein secondary structure prediction accuracy”. Submitted to Bioinformatics. 6 Ashraf Yaseen and Yaohang Li, “Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units,” Journal of Parallel and Distributed Computing, 72(2): , 2012
Acknowledgement This work is partially supported by NSF through grant and ODU SEECR grant 24
Questions? Thank You 25