2014 Using machine learning to predict binding sites in proteins Jenelle Bray Stanford University October 10, 2014 #GHC
Protein Function Proteins are biological molecules that: −Catalyze metabolic reactions −Replicate DNA −Transport molecules −Respond to stimuli
2014 Protein Structure
2014 Protein Binding Sites UC Davis ChemWiki
2014 Goal: Predict where ATP Binds Adenosine triphosphate (ATP) is the primary energy currency of the cell Transports chemical energy within cells for most reactions that require energy in the cell
2014 ATP Model Based on FEATURE Builds 3D models of local environment around a protein site given training sets Calculates chemical properties at varying radial distances from site, and creates a vector containing values of each property in each radial volume Constructs Naïve Bayes model by comparing distribution of feature vectors between positive and negative sites Liang MP, Banatao DR, Klein TE, Brutlag DL, Altman RB. "WebFEATURE: an interactive web tool for identifying and visualizing functional sites on macromolecular structures." Nucleic Acids Res Jul 1;31(13): Wei L, Altman RB. Recognizing protein binding sites using statistical descriptions of their 3D environments. Pac Symp Biocomput. 1998:
2014 Extending the Use of FEATURE So far, FEATURE only used to predict a protein functional site or a single ion binding site – never a whole small molecule (ligand) Want to combine FEATURE models to create an overall model to predict ATP binding
2014 Training Set All PDBs (experimental 3D protein structure files) with ATP bound clustered by 30% sequence similarity, and one protein in each cluster used as positive training set – leads to 190 proteins For negative training set, proteins with ligands not containing any part of ATP selected, then also clustered by 30% similarity – leads to 3345 proteins Leave 20% out of training data for validation
2014 Combining Atomic Models Build individual FEATURE models for 3 atoms in each section of ATP Need to combine the 9 atomic models to give one overall molecular model Train a logistic regression model with the atomic FEATURE scores as features
2014 ATP Docking Want to train model on ATP poses that can actually fit in a binding pocket For positive proteins, calculate FEATURE score for each of 9 atoms in crystal structure ATP For negative, use Vina Autodock to dock 1000 ATP poses into a protein Do this for random sample of negatives equal to number of positive proteins
2014 Choosing ATP Poses for Training For each negative protein, calculate FEATURE scores of the nine atoms for all 1000 ATP poses, then choose pose with highest sum of (normalized) individual scores −Ensures model can distinguish good ATP poses in non-ATP binding proteins from those in real ATP-binding proteins
2014 Logistic Regression Model Build logistic regression model with the 9 individual atomic FEATURE scores for each protein in training set
2014 Model Validation Dock 1000 poses into all training proteins (positive and negative) Use logistic regression model to score and rank every pose, and choose highest scoring pose for each protein Validation AUC = 0.83 Compares favorably to dock energy (physics based model) with AUC = 0.74
2014 ATP Binding Prediction for a Protein Kinase
2014 Acknowledgments Russ Altman for supporting the research Altman group LinkedIn for sending me to GHC
2014 Got Feedback? Rate and Review the session using the GHC Mobile App To download visit