Prediction of RNA Binding Protein Using Machine Learning Technique Avdesh Mishra Reecha Khanal Md Tamjidul Hoque Date: 4/07/2018
Overview: Importance of RNA-Binding Proteins (RBPs) Dataset Collection Features Extraction Feature Encoding Techniques Feature Ranking Machine Learning Results Conclusions
Why RNA Binding Protein Prediction? RNA-binding proteins play important roles in many biological functions mRNA stability Stress response Cell cycle Tumor differentiation Apoptosis Gene regulation at post-transcriptional levels
non-RBP Protein Chains Dataset Preparation: Collection of validation dataset: PISCES UniProt Database Sequence Identity : 25% 68084 RBPs X-ray resolution: 3Å Sequence Length: 50 – 10,000 amino acids 14389 non-RBP Protein Chains
CD-HIT: 14389 Protein Chains 68084 RBPs Sequence Identity Cutoff >= 25% Protein Length: 50 – 10,000 amino acids 7077 NonRBPs 2770 RBPs
Final balanced dataset consists of: Previously established dataset: 2780 RBPs 7077 NRBPs We prepared a balanced dataset by taking a subset of previously established dataset. Proteins with non-standard amino acids removed Redundancy removed Final balanced dataset consists of: 1700 NonRBPs 1700 RBPs
Feature Set for ASA Prediction Feature Set For RNA BINDING PROTEIN PREDICTION Hydrophobicity Polarity Polarizability Van Der Waals Volume SA SS PSSM The property in which molecules repel water molecues Separation of electric charge leading to a molecule or a chemical group having electric dipole or multipole moment. A measure of how easily an electron cloud is distorted by an electric field. The Volume occupied by an individual atom or an molecule. (Solvent Accessibility) The measure of surface area accessible to a solvent (Secondary Structure) The prediction of structure of an Amino Acid. (Position Specific Scoring Matrix) Evolutionary information obtained from sequence alignment computed using PSI-BLAST
Sequence and Feature Vector encoding. Feature Set For RNA BINDING PROTEIN PREDICTION (2518 Features) Hydrophobicity Polarity Polarizability Van Der Waals Volume SA SS PSSM Solvent Accessibility (13) Two different types of amino acids (buried, exposed ) probabilities predicted using ACCPro. Then C-T-D is applied Secondary Structure Probabilities (21) Three different secondary structure (helix, beta and coil) probabilities predicted using SSPro. Then C-T-D is applied (1900 PSSM-DDT + 100 PSSM-SDT 400 PSSM-EDT) 2400 Features extracted by PSSM distance transformation (21) 20 Amino Acids Divided into three different Groups and later C-T-D is applied.
C-T-D (Composition, Transition, and Distribution) Composition: composition of a particular group of amino acid in the sequence Transition: change of amino acids from one group to other as we go linearly through the sequence Distribution: how one amino acid group is distributed throughout the protein sequence C-T-D (Composition, Transition, and Distribution)
C-T-D(Composition, Transition, and Distribution) Property Group 1 Group 2 Group 3 Hydrophobicity Polar R, K, E, D, Q, N Neutral G, A, S, T, P, H, Y Hydrophobic C, V, L, I, M, F, W Normalized van der Waals Volume 0 – 0.278 G, A, S, C, T, P, D 2.95 – 4.0 N, V, E, Q, I, L 4.43 – 8.08 M, H, K, F, R, Y, W Polarity 4.92 – 6.2 L, I, F, W, C, M, V, Y 8.0 – 9.2 P, A, T, S 10.4 – 13.0 H, Q, R, K, N, E, D Polarizability 0 – 0.108 G, A, S, D, T 0.128 – 0.186 C, P, N, V, E, Q, I, L 0.219 – 0.409 K, M, H, F, R, Y, W
Example: A E AAA E A EE AAAAA E A EEE AA EE A EEE AA E Number of A’s (n1) = 16 | Number of E’s (n2) = 12 Composition for n1 = 16/28 Composition for n2 = 12/28 Transition : (15/29*100) since there are 15 transitions from A to E or E to A Distribution: For A: first position of sequence 1st (1/30*100) 25% 5 50% 12 75% 20 100% 21 Using this approach we obtain 21 dimensional vector.
PSSM-DT (position specific scoring matrix- distance transformation) PSSM-DDT: PSSM-DDT measures the occurrence probabilities of pairs of different amino acids separated by a distance of d in a protein from the PSSM profile. Distance Between Two pairs of Amino Acids ** i1, i2 the two pairs of different amino acids * L Length of protein sequence
PSSM-DT (position specific scoring matrix- distance transformation) PSSM-SDT: Measures the occurrence probabilities of a pair of same amino acids separated by a distance d in a protein from the PSSM profile Distance Between Two pairs of Amino Acids ** i individual amino acid * L Length of protein sequence
PSSM-DT (position specific scoring matrix- distance transformation) PSSM-EDT: Measures non-co-occurrence probability for two amino acids separated by a certain distance d in a protein from the PSSM profile Distance Between Two pairs of Amino Acids ** Ax, Ay the two pairs of different amino acids * L Length of protein sequence
Final Training set of 1001 Features obtained Feature Ranking: Minimum redundancy maximum relevance (mrmr) feature selection technique Final Training set of 1001 Features obtained
Machine Learning Approach: Training Features for Base-classifiers X = {f1, f2, f3, …, f2518} KNN Classifier GBC LOGREG Training Features for Meta-classifiers X = {PKNN-bind , PKNN-non-bind , PGBC-bind , PGBC-non-bind, PLOGREG-bind , PLOGREG-non-bind, f1, f2, f3, …, f2518} SVM Classifier StackRBPPrediction
Results: Accuracy, Sensitivity, and Specificity was calculated using 10-fold cross validation. Model Prediction of RNA-Binding-Protein using Stacking OVERALL ACCURACY (ACC) 91.24%
Comparing different machine learning Approaches: Accuracy Logistic Regression (LOGREG) 88.52 Support Vector Machine (SVM) 90.53% Stacking 91.24%
Fig: Comparison with a recently proposed similar predictor (RBPPred): Model RBPPred Our Predictor Accuracy 67.82% 91.25% Fig: Comparison with a recently proposed similar predictor (RBPPred):
Thank you for your attention.